Arrays and Lists in SQL Server
Data from Performance Tests in SQL 2008

An SQL text by Erland Sommarskog, SQL Server MVP Latest revision 2012-07-01. Copyright applies to this text.

Introduction

This is an appendix to the articles Arrays and Lists in SQL Server 2005 and Beyond and Arrays and Lists in SQL Server 2008 (Using Table-Valued Parameters). In this appendix, I present data and analysis from the suite of performance tests I ran in 2009 and 2010 on SQL 2008 for the various list-to-table methods, including table-valued parameters. There is another appendix that with data from tests that I ran in 2006 on SQL 2005. If you have arrived at this page from a web search, you may want to at least browse the main articles first.

The main reason to run this set of tests was to capture the performance of new methods: table-valued parameters, Itzik Ben-Gan's clever fn_nums, Adam Machanic's new CLR function and the primitive method of passing many parameters. This time I also made the effort to measure call overhead and a feeble attempt to measure multi-thread performance.

This appendix is differently organised from the first. You will not see much numbers within the article itself, but instead I will refer you to big digit dumps in form of Excel books with the data, and the article focuses on analysis. The Excel books are saved in the old Excel 2003 format, however they display better in Excel 2007.

I first list all tested methods whereupon I describe the test setup for the main, single-threaded, tests. I then proceed with making various observations of the outcome of these tests. I continue with discussing my results from testing call overhead, and finally I turn to the multi-thread test, describing how it was set up, and observations I have made from it.

Table of Contents

   General Disclaimers
   The Contenders
   The Test Setup
      Used Hardware
      The Test Table
      The Test Script
      The Input Data
      Tested Operations and Sample Test Procedures
      Measuring Time
   Single-threaded Results
      Overview of single-summary.xls
      Query Plans
      Table-Valued Parameters
      CLR
      XML
      Looking at fn_nums
      JOIN vs. EXISTS and the Others
      The MAX Threshold
      Comparing the Original Test with the Rerun
   Call Overhead
      Methodology
      Clientsummary.xls
      Table-valued Parameters
      Passing Many Parameters
      XML
      Plain Strings
   Multi-threaded Tests
      Test Setup
      What to Measure
      Tested Methods and Operations
      Hardware etc.
      multithreadsummary.xls
      General Pattern
      Winners...
      ... and Losers
      Parallelism?
   Conclusion
   Revisions

General Disclaimers

These tests are fairly simple in nature. I have run the tests on an idle machine, where I've stopped as many services I've dared, in order to minimise disturbing activity in the machine. The test table has been small enough to fit entirely into memory on the test box. The test machine itself is a laptop, not a server-class machine you would use for a production system.

This means that these while these tests give us some information, there are also pieces missing. The numbers should be informative enough to tell us how fast or slow the methods as such are. But they do not tell us how they will work with your tables, and nor do they say much about how they will work under load. Indeed, I did run multi-threaded tests this time, but the results of those tests are fairly inconclusive, not the least because of the poor choice of hardware.

Thus, if you need top performance, you need to conduct your own tests with your tables, preferably on a production-class machine. You should decide whether you want parallelism or not and make sure you turn of parallelism if this is not desired. You should also conduct your tests under load, for instance running tests with parallel clients hammering the server.

I like to stress that you cannot use the results in this appendix to compare with my previous tests for SQL 2005 or SQL 2000. There are differences in the test setup, and I have used different hardware this time.

All stored procedures, user-defined functions and the test script are available through links in this appendix. Please contact me, if you also want the test data. This is 50 MB in zipped format, which explains why I have not made it available for download.

A quick history: I originally ran these tests in 2009, but in 2010 I became aware of several important things with XML, and I felt compelled to rerun the tests. This article looks mainly at the results from 2009, since I originally wrote the article from those test results. I make reference to the results from 2010 where this is relevant. I did not rerun the multi-threaded tests in 2010.

The Contenders

For this suite of tests, I did not include the really slow methods and the poorly performing unpack_with_union and unpack_with_insert. I also excluded minor variations that I had in the tests in 2006 to test some special tweak. As with the old tests, I've have run tests for more methods that I present here, but I've excluded methods that performed less well, or were just a variation of some other method. Still there are 34 different methods, falling into 12 groups.

I have given each method that I've tested a name, and I will use these names as a convenient short-hand notation. Throughout the appendix, these names appear as links and if you click the link, you will be brought to a page with the test procedures for the method in questions, and these pages in their turn have links to the tested functions. Thus, in this way you can see exactly what I have tested.

Here are the 34 methods, listed in alphabetic order, and with a thicker line separating different groups of methods.

CLR$ADAM	Adam Machanic's CLR function that uses a char array and where he avoids IndexOf. For the tests of integer lists, I added a function that returns int, so that the conversion to integer is performed in C#.
CLR$FIX	Fixed-length format, implemented with the CLR.
CLR$ITER	Rolling our own in the CLR.
CLR$SPLIT	CLR using the split method.
CTE$CHUNK	A chunking procedure using a recursive CTE.
CTE$IL	cte_split_inline in the main article.
CTE$MSTMT	A multi-statement function with the same contents as cte_split_inline.
EXEC$A	Dynamic SQL.
FIX$BINARY	fixbinary_single
FIX$MSTMT	A multi-statement version of fixstring_single.
FIX$MULTI	fixstring_multi.
FIX$MULTI2	fixstring_multi2.
FIX$SINGLE	fixstring_single.
FNFIX$BINARY	fn_fixbinary_single, a version of fixbinary_single that uses fn_nums.
FNFIX$MSTMT	A multi-statement version of fixstring_single that uses fn_nums.
FNFIX$SINGLE	fn_fixstring_single, a version of fixstring_single that uses fn_nums.
FNNUM$CHUNK	fn_chunk_split_me, a version of chunk_split_me that uses fn_nums.
FNNUM$IL	fn_inline_split_me, a version of inline_split_me that uses fn_nums.
FNNUM$MSTMT	A multi-statement version of FNNUM$IL.
ITER$CHUNK	The iterative method with chunks.
ITER$SIMPLE	The iterative method without chunking.
MANYPARAM	Passing many parameters.
MANYSELECT	Making the list into many SELECT statements.
TBLNUM$CHUNK	chunk_split_me, same as duo_chunk_split_me, but with a single return column
TBLNUM$IL	inline_split_me.
TBLNUM$MSTMT	mstmt_split_me, a multi-statement version of inline_split_me.
TVP$NOPK	A table-valued parameter without a primary key in the definition.
TVP$PK	A table-valued parameter with a primary key.
XMLATTR	Attribute-centric untyped XML.
XMLATTR$SCH	Attribute-centric XML constrained by an XML schema.
XMLATTRPOS	Returning the list position from an XML document by using the Numbers table.
XMLELEM	Element-centric untyped XML using `value('.')` to address the elements.
XMLELEM$SCH	Element-centric XML constrained by an XML schema
XMLELEM$TEXT	Element-centric untyped XML using `value('(./text())[1]')` to address the elements

New from the tests in 2006 are CLR$ADAM, all FNFIX and FNNUM methods, MANYPARAM, TBLNUM$MSTMT, TVP$NOPK, TVP$PK, XMLATTRPOS. TBLNUM$CHUNK was called TBLNUM in those tests. TBLNUM$MSTMT was probably among the methods I opted to exclude in 2006, but it happened to be included this time. New are also XMLATTRPOS, XMLATTR$SCH, XMLELEM$SCH and XMLELEM$TEXT. Note that the last three were only included in the rerun in 2010.

The methods that have been retained from the old test suite in 2006 are mainly unchanged. The difference is the methods based on inline_split_me and cte_split_inline where I've applied the workaround for the caching bug.

There are differences to all XML methods between the 2009 tests and the rerun in 2010. There are also differences to the FNFIX and FNNUM methods, as I used different implementation of fn_nums in 2010.

The Test Setup

I here describe the test setup. It is similar to the 2006 tests, but there are also several differences and I note these differences as far it is relevant. The description here focuses on the single-threaded tests. The multi-threaded tests were set up somewhat differently, and I describe this setup later.

Used Hardware

I ran all tests on my laptop, which has a dual-core CPU, running at 2.2 GHz. The machine has 4 GB of memory and runs Vista Business x64 SP2. For the single-threaded tests I also ran the test script on this machine. The laptop has only one disk, so data and log files for the test database and tempdb are all on the same disk.

The tests in 2009 were run on SQL 2008 SP1. In the tests 2010, I had upgraded the instance to a CTP version of SQL 2008 SP2. In 2010, I ran the tests with trace flag 4130 enabled.

This time I did not find it worthwhile to run the tests on several machines. Unfortunately, I did not have access to server-class hardware.

The Test Table

I used the same test table as in the tests in 2006:

CREATE TABLE usrdictwords
  (wordno int          NOT NULL,
   word   nvarchar(50) NOT NULL,
   guid   char(36)     NOT NULL)
CREATE UNIQUE CLUSTERED INDEX wordno_ix ON usrdictwords(wordno)
CREATE UNIQUE INDEX word_ix ON usrdictwords(word)

The table contains 1 163 750 rows, and the average length of the column word is 10.6 characters. The longest word is 30 characters. wordno holds numbers in the range from 0 to 1 163 749. The order of wordno is not correlated with the order of word. The column guid serves to make the table a little wider, and to be a token non-indexed column. The data in guid is not used for anything in the tests.

The collation for the database is Slovenian_CS_AS. (The data is a taken from a Slovenian version of the Unix file /usr/dict/words, used for the spell utility, that I culled off the net somewhere.)

The Test Script

The test script is written in Perl, connecting to SQL Server using the SQL Server Native Client OLE DB provider.

The script creates the database with a size of 195 MB for the data file and 100 MB for the log file (the initial size is 750 MB during the data load, but the script shrinks it before running the tests). The database is in simple recovery. tempdb is not controlled by the script, but I made sure that tempdb had at least 200 MB for data and 50 MB for the log.

The test script runs with the default SET options you get with OLE DB, except that it also issues SET NOCOUNT ON.

When running the tests, the test script creates lists of strings and integers as described below, and passes the assembled lists to all test procedures currently in the test database in alphabetic order. That is, all methods see the same input data. For the single-threaded test, the script runs for six list lengths: 20, 200, 650, 2 000, 10 000 and 50 000 list elements. The script first runs all test procedures for lists with 20 elements 100 times (but see below), then all for 200 elements 100 times etc. All input lists have unique entries; there are no duplicates in them.

When starting on a new list length, the script performs an sp_recompile on usrdictwords to flush all plans. It then runs a test 0 for that length that is not included in the results, to avoid distortion from the cost of compilation.

Normally, a test procedure is run 100 times for a certain list length. However, since some methods are slow or excessively slow, the script stops running a procedure if its total execution time exceeds three minutes. Nevertheless, the procedure is still permitted to run three times more (including the initial test 0) on the next list length so that I can catch the development as the number of elements rises. However, if the total time for a procedure exceeds 15 minutes, this second chance is abandoned.

The script did not run MANYPARAM for 10 000 and 50 000 elements, as this method cannot support this length. I also excluded EXEC$A from running at 50 000 elements, since this failed with an internal query-processor error.

The script starts a transaction before running a test procedure and commits when execution is completed.

You can view the complete test script here.

The Input Data

For each test round, the test script randomly selects a set of unique numbers in the range from 0 to 1 163 749. These numbers constitute the input for the integer lists. The script uses these numbers retrieve the corresponding words in usrdictwords and thereby it gets the input for the string lists.

The exact format of the lists depends on the method, with the most common being a comma-separated list with no extra spacing. (As opposed to my tests for SQL 2000 and SQL 2005 where I randomly added extra spacing.) Other formats are as follows:

Methods	Data type	Format
CLR$ITER, CLR$SPLIT, ITER$CHUNK and ITER$SIMPLE	Integer	Space-separated list
EXEC$A	String	Elements are wrapped in single quotes to appease SQL syntax.
FIX$BINARY, FNFIX$BINARY	Integer	Binary format.
CLR$FIX and other FIX and FNFIX methods	Integer	Fixed length with 9 positions per number.
CLR$FIX and other FIX and FNFIX methods	String	Fixed length with 30 positions per string.
XML methods:	All	XML documents as shown in the XML section in the main article.
TVP methods and MANYPARAM	All	Data passed as the method requires. More details on table types for TVPs later.

Tested Operations and Sample Test Procedures

I extended the set of operations from the previous tests with two more, to a total of four. Here is a summary, followed by a sample of each type of procedure.

UNPACK – simply unpack the list, involving no other table. (Save Numbers for the methods that rely on this table.) This test is not possible to carry out for dynamic SQL and MANYPARAM.
JOIN – A join operation against the test table. This also includes dynamic SQL and MANYPARAM, although for them it is not a join in the true sense.
COUNT – The same query as for JOIN, but with a SUM aggregate to avoid the need to insert the retrieved data into a temp table.
EXISTS – An EXISTS query against the test table. This test is not possible to make different from JOIN for dynamic SQL and MANYPARAM. Neither is it meaningful to perform this test with XMLATTRPOS.

For most methods I've run the tests with lists of numbers and lists of strings. The exceptions are FIX$BINARY and FNFIX$BINARY, which are only meaningful with numeric input.

There is a stored procedure for each combination of method, operation, data type which means that for a single method there is up to eight test procedures. Below I show an example of test procedures for each operation and data type. To see the test procedures for a specific method, click on the link for the method in question.

UNPACK

Here is a typical procedure for the UNPACK operation:

CREATE PROCEDURE CTE$MSTMT_Int_UNPACK_test
                 @str      nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

INSERT #Int_UNPACK (number)
SELECT number = convert(int, str)
FROM   cte_split_mstmt(@str, DEFAULT)

SELECT @tookms = datediff(ms, @start, sysdatetime());

IF @retdata = 1 SELECT number FROM #Int_UNPACK
TRUNCATE TABLE #Int_UNPACK

The procedure: 1) Accepts an nvarchar(MAX) parameter. 2) Starts a timer. 3) Issues the query, inserting the result set into a empty temp table. 4) Stops the timer. 5) Returns the data to the test script if requested. 6) Truncates the table after the run.

Notes:

The data type for the list parameter depends on the method. XML methods obviously take an xml parameter, and the procedures for TVP take a table type. The test procedures for MANYPARAM have the 2 000 parameters coming after @tookms.
I discuss the timer in a separate section below.
The test script creates the temp tables when it starts, and the temp tables exists throughout the entire run of the test. In the tests in 2006, I used permanent tables in the test database, but temp tables seem to give less overhead. The reason I use tables at all, is that returning the data to the client is likely to incur a much greater overhead. Also the performance of the client API could affect the results, as would the network if client and server are on different machines. (SELECT INTO? I tried that in the tests in 2006, but I got spurious values that deviated widely from the normal pattern. Probably that was due to autostats on the system tables. When I abandoned SELECT INTO, the issue went away.)
I use the option to return the data to the client to verify that that the method returns the expected data when running test 0 – no use testing a method that returns incorrect results!

JOIN

Here is a sample procedure of the JOIN operation for an integer list.

CREATE PROCEDURE CLR$ITER_Int_JOIN_test
                 @str      nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

INSERT #Int_JOIN (word)
SELECT u.word
FROM   usrdictwords u
JOIN   CLR_intlist_iter(@str, DEFAULT) AS a ON u.wordno = a.number

SELECT @tookms = datediff(ms, @start, sysdatetime());

IF @retdata = 1 SELECT word FROM #Int_JOIN
TRUNCATE TABLE #Int_JOIN

The expected query plan is to retrieve the data through a Clustered Index Seek on wordno.

And here is one with a list of strings. In interest of brevity, I only show the INSERT statement:

INSERT #Str_JOIN (wordno, guid)
SELECT u.wordno, u.guid
FROM   usrdictwords u
JOIN   inline_split_me(@str) AS a ON u.word = a.Value

To make the tests more interesting, I designed the JOIN procedures for string data so that it retrieves both the columns wordno and guid. This gives the optimizer two obvious choices to access usrdictwords: 1) Do an Index Seek on the index on wordno followed by a number of key lookups on the base table. 2) Scan usrdictwords as part of a hash or merge join.

COUNT

Here is how a COUNT procedure for integer data looks like:

CREATE PROCEDURE CLR$FIX_Int_COUNT_test
                 @str nvarchar(MAX),
                 @retdata  bit = 1,
                 @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

DECLARE @cnt bigint
SELECT @cnt = SUM(len(word))
FROM   usrdictwords u
JOIN   CLR_intlist_fix(@str, 9) AS a ON u.wordno = a.number

SELECT @tookms = datediff(ms, @start, sysdatetime());

As far as it comes to retrieving data from the test table, the COUNT procedures are the same as the JOIN procedures. But instead of inserting the data into a temp table, I perform a SUM operation. In this way I reduce the overhead that caused by the insertion into the temp table. I particularly devised these procedures for the multi-threaded tests. While the access path to the table is the same for JOIN, it is a different operation, and the optimizer may still make different choices.

(Why do I call it COUNT when I use SUM? I first used COUNT, but then I decided to use SUM to force SQL Server to read all rows. But I never came around to change the name.)

Here is a COUNT procedure for strings:

CREATE PROCEDURE XMLATTR_Str_COUNT_test @str      xml,
                                        @retdata  bit = 1,
                                        @tookms   int = NULL OUTPUT AS

DECLARE @start datetime2(3)
SELECT @start = sysdatetime()

DECLARE @cnt bigint
SELECT @cnt = SUM(len(guid))
FROM   usrdictwords u
JOIN   @str.nodes('/Root/Word') AS T(Item)
  ON   u.word = T.Item.value('@Item', 'nvarchar(50)')

SELECT @tookms = datediff(ms, @start, sysdatetime());

As with the JOIN operation, I force SQL Server to read the data pages. (@cnt will always be 36 times number of rows, but SQL Server still has to read every row, since len() does not include trailing spaces and SQL Server does not know that I filled every char(36) to the max.)

EXISTS

Finally, here are samples of the EXISTS operation. I show the INSERT statements only:

INSERT #Int_JOIN (word)
SELECT u.word
FROM   usrdictwords u
WHERE  EXISTS (SELECT *
               FROM   inline_split_me(@str) AS a
               WHERE  u.wordno = convert(int, a.Value))

INSERT #Str_JOIN (wordno, guid)
SELECT u.wordno, u.guid
FROM   usrdictwords u
WHERE  EXISTS (SELECT *
               FROM   cte_split_chunk(@str, DEFAULT) AS a
               WHERE  u.word = a.str)

Measuring Time

As you can see from the sample procedures, I use the new system function sysdatetime() that returns datetime2 to measure execution time. While datetime2 permits a precision up to 100 ns, I only measure whole milliseconds. There is a simple reason for this. If you run a loop like:

SET NOCOUNT ON
DECLARE @i int = 1000
DECLARE @t TABLE (d2 datetime2)
WHILE @i > 0
BEGIN
   INSERT @t VALUES(sysdatetime())
   SELECT @i -= 1
END
SELECT d2, COUNT(*)
FROM   @t
GROUP  BY d2

You will see an output like this:

2009-09-27 19:12:55.8656640    1
2009-09-27 19:12:55.8666640   25
2009-09-27 19:12:55.8676640   26
2009-09-27 19:12:55.8686640   38

That is, the resolution of the values returned from sysdatetime() is only a millisecond. I played with a CLR routine that uses a .NET class which in its turn calls the routine QueryPerformanceCounter in the Windows API, and which has a resolution of 100 ns. However, I decided not rely on it. Just because something has a certain resolution does not mean that it has the same accuracy. When SQL 2005 shipped, SQL Trace was able to return durations in microseconds. But if you run a trace today in SQL 2008 or SQL 2005 SP3, you will only see durations in whole milliseconds. The reason Microsoft changed this is that QueryPerformanceCounter relies on the CPU frequency being constant, but that is not always the case in a modern computer, because the BIOS often lower the frequency to preserve energy. For a longer discussion on this topic, read these blog entries from Bob Dorr, an Escalation Engineer for SQL Server, available through this link and this other link.

While it seems from Bob Dorr's posts, that we should be able to rely on a one-millisecond resolution, I have my doubts. I ran the test suite several times, mainly because I found new twists, or identified an error somewhere. And sometimes when I had run the test suite, I saw timings of -1 ms. Why I don't know, but I assume it is related to these issues. Presumably the start value was read on one CPU, and the stop value on the other. Thankfully, the results I publish do not include any such anomaly. (Interestingly enough, many full runs of the test suite included no such values, and then I could run one where 85 out of 100 executions for a procedure clocked in at -1. I never saw -2 or even worse, though.)

So even if the resolution nominally is one millisecond in this data, which is better than I was able to use when I ran my tests on SQL 2000 and SQL 2005, there is all reason to be careful with the exact value.

Single-threaded Results

The tests produced far too many numbers to show here. You find the results from the single-threaded tests in the two Excel files single-summary.xls and single-summary-2010.xls. The first file holds the original tests from 2009, where as the second file is the rerun in 2010. It's not that you absolutely must download and open them – but I will more or less assume that you do. In difference to the articles for the tests on SQL 2000 and SQL 2005, I am not including any numbers in the article itself. I will mainly refer to the data from the first file.

I start with giving an overview of the Excel files, and then I go on and discuss various interesting facts from the results.

Overview of single-summary.xls

The Excel files have a number of tabs. In most tabs there is one column per data type, operation and list length. The method names are coloured by category as a reminder:

Blue – Direct SQL.
Dark red – T-SQL Inline function.
Green – Opaque inline.
Black – Intermediate storage.

See the section Inline, Multi-Statement and Temp Tables in the main article for explanation of these terms.

These are the tabs:

Timings – The median execution time in milliseconds for the test.

Rankings – Ranking within the column. To reduce the impact of random flutter, the ranking ignores the last digit, so102 and 109 rank equal. (But 109 and 110 ranks differently.) This has the effect that for shorter lengths about all methods tie for the first place.

Ratios to top – The ratio between the execution time for this method and the best method in this column. When the winner has a median of 0, this counts as 1 here. (When computing the ratios, I've used all digits, so 109 and 102 gives a ratio of 1.07.)

Ratios to prev – The ratio between the execution time at this list length and the execution time at the previous length for the same method, operation and data type. All methods have a blank for 20 elements obviously. Methods with a median of 0 for 20 elements, also has a blank for 200 elements. At the bottom, you can see the ratios between the lengths itself. If the ratio exceeds the length ratio, this is bad. The numbers are in red if the ratio exceeds the length ratio with a factor of 1.2.

Compare to JOIN – This tab compares the other operations to the JOIN operation. More about this tab later.

Parallel Plans – This tab details the combinations where I got plans with parallelism in the tests. This tab is only present in single-summary.xls, but there is one column for the tests in 2009 and one for the tests in 2010. If it says All, the plan was parallel for all lists lengths for which the test was run. It the space is blank, there was no parallelism that year. Else the specific lengths for which there were parallel plans are listed.

Comparisons – This tab has some special comparisons between selected methods, that I will return to.

Comparison 2009-2010 – This tab is present only in the file from 2010. I cover it in the section Comparing the Original Test with the Rerun.

No of runs – How many times the method was run for this data type and operation. 100 in most cases, but lower if the test procedure exceeded the permitted execution time. (See the description of the test script for details.)

Query Plans

Every now and then I will refer to the query plans and try to describe them as well as I can. If you are really geeky, you can download the plans from queryplans.zip. This file includes a BCP file and a format file. You need this table:

CREATE TABLE queryplans (testsuite      varchar(8)    NOT NULL,
                         method         varchar(20)   NOT NULL,
                         datatype       char(3)       NOT NULL
                            CHECK (datatype IN ('Str', 'Int')),
                         optype         char(6)       NOT NULL
                            CHECK (optype IN ('UNPACK', 'EXISTS', 'JOIN', 'COUNT')),
                         listlen        int           NOT NULL,
                         estrows        bigint        NOT NULL,
                         isparallelplan varchar(1)    NOT NULL,
                         query_plan     xml           NOT NULL,
                         PRIMARY KEY (testsuite, method, datatype, optype, listlen)
)

To load the file, run:

BCP yourdb..queryplans in queryplans.bcp -f queryplans.fmt -T

(Add -S for server and change -T to -U and -P for username and password as needed.) To view the plans for a certain group of tests, run for instance:

SELECT * FROM queryplans
WHERE method = 'CLR$SPLIT' AND datatype = 'Int' AND optype = 'JOIN'
ORDER BY listlen

The column testsuite can have two values Test2009 and Test2010 and thus refers to which of the test runs the plan comes from. The column estrows, is the number of rows SQL Server thinks the query will return; the value is always 1 for COUNT. The column isparallelplan has an X if the plan has parallelism, else it's blank.

If you have Mgmt Studio 2008, just double-click the XML document and SSMS will recognize that this is an execution plan and display a graphical plan. If you have SSMS 2005, you need to save the XML with the .sqlplan extension and reopen it. The plans are actual execution plans and include actual number of rows and executions as well as estimated values. (Reportedly this is broken in some versions of SSMS 2008 R2.)

I had to exclude the plans for the MANYPARAM method from the zip file – these plans are huge. When I first compiled the zip file, it was almost 50 MB, but when I excluded MANYPARAM, the size went below 2 MB. Note also that the plan for (CLR$ADAM, Int, COUNT, 20) is the plan for something else due to a bug in my test script.

Table-Valued Parameters

The most interesting method to check out is of course that new kid on the block: table-valued parameters. In the main articles, I tout this as the best method. So how does it fare?

First, only so know you what I'm talking about, these are the table types used by TVP$NOPK:

CREATE TYPE intlist_tbltype AS TABLE (n int NOT NULL)
CREATE TYPE stringlist_tbltype AS TABLE (str nvarchar(50) NOT NULL)

And TVP$PK uses these:

CREATE TYPE intlist_pktype AS TABLE (n int NOT NULL PRIMARY KEY)
CREATE TYPE stringlist_pktype AS TABLE (str nvarchar(50) NOT NULL PRIMARY KEY)

We turn to the Rankings sheet and we can see that TVP ranks very high. The two methods often hold positions 1 and 2, but there are a few notable exceptions. The most striking is that TVP$PK for the JOIN, COUNT and EXISTS for integer data falls through the field for 10 000 elements, only to return to the top for 50 000 elements. If you flip over to Timings or Ratios to prev, you can also see that there is a increase in execution time that by far exceeds the increase of number of elements. For JOIN, the execution time rises from 9 ms at 2 000 elements, to 264 ms at 10 000 elements, and then there is a modest increase to 339 ms for 50 000 elements.

What is going on? The answer lies in the query plan. For most methods, the query plan will be the same no matter the number of elements, because the optimizer does not understand to use the length of an input string for its considerations. But this is different for table-valued parameters, where the optimizer can sniff the number of elements. Usually an advantage, but this time it backfires. For TVP$NOPK, as well as most other methods, the optimizer settles for scanning the input source, and then performs a loop join against usrdictwords. But for TVP$PK, the optimizer decides to perform a hash join and scans both tables. And for 10 000 elements, this is an incorrect decision. But the more elements we send in, the more lookups the loop join needs to perform. Consider the extreme situation where we have as many elements as there are rows in the source table: the loop join will be many times slower than a table scan with a hash join.

At which point the loop join and hash join have equal performance I don't know, but it seems to be above 50 000 elements. Because at 50 000, TVP$NOPK still goes for the loop join and is significantly faster than TVP$PK. But the difference is smaller, and this permits TVP$PK to catch up and overtake most other methods. For 10 000 elements, TVP$PK is 5.7 slower than TVP$NOPK for JOIN, but at 50 000 elements this ratio is only 1.6.

Now, why the optimizer makes different decisions depending on the table parameter has a PK or not, I don't know. Maybe the optimizer thinks that since the values are known to be unique, we will hit more distinct pages in usrdictwords, and therefore a loop join will be costlier. And this could make sense, if none of the table had been in cache. (The way the test is setup, the table should be entirely in cache.) I should add that for strings, the optimizer makes a similar choice, but only at 50 000 elements.

The conclusion of this is not that having a PK for your TVP is a bad idea, this is just one of these oddities that can happen with a cost-based optimizer that makes estimates from statistics. If we dig a little more into the data, we can find a case where the primary key definitely helps. If you look at the execution times for the EXISTS operations, you see that TVP$NOPK falls a little behind from the absolute top. It ranks only #5 for 10 000 integer elements and #6 for 10 000 string elements. You can also see that it is noticeably slower than TVP$PK at 50 000 elements. The reason for this is in the query plan: The plan for TVP$NOPK for all EXISTS queries include a hash aggregate to weed out potential duplicates. This operator is not needed for TVP$PK, since here the optimizer knows that there are no duplicates, and TVP$PK gets the same query plan for JOIN and EXISTS. (Then again, as we shall see when we look at call overhead, the TVP$PK has this cost when it's populated.)

If you look in the Timings sheet, there are a few more differences between TVP$NOPK and TVP$PK. For COUNT on integer, TVP$NOPK is a lot faster at 50 000 elements, and there is a similar difference in the reverse direction for the COUNT of strings at 10 000 elements. These differences are due to that one of them gets a parallel plan, the other not. There are a few more cases where parallel plans are in sway. It seems that if I were to run these tests on a multi-core box, there would be more dramatic differences.

A final note: the misjudgement to use a hash join for 10 000 integer elements was not a complete surprise to me. I observed this in the tests in 2006 as well, when this happened to methods using temp tables as I discuss in the sections Inline, Multi-statement and Temp Tables and Dynamic SQL in the other performance appendix. In the 2009 tests, this is represented by the methods MANYSELECT and EXEC$A.

CLR

Next we look at the CLR, and the best sheet to study these methods is the Timings sheet. If you look for differences between the four CLR methods, you find that they are like Siamese quadruplets. Only for 50 000 string elements, the differences are big and consistent enough that we can place CLR$ITER and CLR$SPLIT in one group ahead of the other two. But the difference is still very small: 30 ms for a total execution time of 500 ms or more.

XML

Differences Between 2009 and 2010

I made a couple of changes of the XML methods from the tests in 2009 to 2010. The tests in 2010 align with what I say in the section on XML in the main article. These things were different in 2009:

For XMLATTR and XMLATTRPOS, I did not address the attributes as @item[1], but only as @item.
In 2009, I did not run with trace flag 4130 in force. This flag have a significant effect for XMLELEM.
In 2009, as well as the original tests in 2006, the input for XMLELEM incorrectly had three levels and not two. That is, instead of:
```
<Root><Num>123</Num><Num>456</Num></Root>
```
I had:
```
<Root><Num><num>123</num></Num><Num><num>456</num></Num></Root>
```
This error was a holdover from the test script for the SQL 2000 scripts. Due to how '.' works as I explain in the main article, the query still returned the correct data. When I started to test typed XML, the error was revealed.
For the tests in 2010, I added the methods XMLATTR$SCH, XMLELEM$SCH and XMLELEM$TEXT to test the impact of using an XML schema, and to test the better way to shred element-centric XML, using the text() node test.

Getting the List Position with XMLATTRPOS

If you look in the Timings sheet, you can see that XMLATTRPOS is always slower than XMLATTR, which is not surprising, since it is a more inefficient method. But you can also see it never takes twice the time of XMLATTR (save in once case where the absolute difference is only 3 ms). And given that it "loops" over the XML document to extract one element at a time, this is astonishingly good. You can also note that XMLATTRPOS performs clearly better than the iterative method, and it plays an even game with the CTE-based methods.

In the sheet Comparisons (and this is still in single-summary.xls), I have computed the ratios for the other XML methods, the CTE methods, the iterative methods and MANYSELECT with regards to XMLATTRPOS. Numbers in blue highlight where the other method is faster than XMLATTRPOS. (Ignore the lower half of this sheet for now.) You can see that XMLATTR only has blue numbers, and so has XMLELEM for JOIN and COUNT, but apart from that, only CTE$IL has more than one blue cell.

Element-Centric XML

As long as you only look in single-summary.xls, that is, the data from 2009, XMLELEM does not fare well. For UNPACK and EXISTS, the execution times are absurd: for instance, 191 seconds to unpack 10 000 elements. As you may guess, there are query-plan disasters. The plans for UNPACK and EXISTS include an Eager Spool operator which is missing from the plans for JOIN and COUNT. A eager-spool operator caches data in an internal temp table, which is good if the data can be reused, but this does not happen here, and all the operator contributes with is a massive overhead. If you look in the Ratios to prev sheet, you can see that for these bad cases, the execution times for XMLELEM grow far more than the number of elements does. (You may note that in this misery, EXISTS is still only half as slow as UNPACK; this is due to that EXISTS benefits from a parallel plan.)

All this is different in single-summary-2010.xls. This time I ran with the trace flag that activates the fix for these poor query plans. Nevertheless, if you look in Timings (in single-summary-2010.xls), you see that XMLELEM is always clearly behind XMLATTR, save a few cases at short list lenghts. In the sheet Comparisons of this Excel-book, I have compared all the XML methods using XMLATTR as the base. Numbers are in blue when a method is faster than XMLATTR. For XMLELEM the ratio varies between 1.13 and 1.32 (if we ignore the outliers at 200 list elements.)

If you instead look at the results for XMLELEM$TEXT, you can see that this method in most cases is slightly faster then XMLATTR. And while this method is more bulky to code, this is the method you should use when you work with element-centric XML. Had I known about this in 2006 when I first ran my tests with the xml data type, I would have implemented XMLELEM with the text() node test and it would have been the end of story.

Typed XML

Finally, let's look at the methods using an XML schema. Let's stay in Comparisons and first look at the results for integer data. We can see that both XMLATTR$SCH and XMLELEM$SCH fall behind XMLATTR considerably, over 1.8 times slower for the COUNT method. Below the rows with the ratios, you can see the absolute difference to XMLATTR, and for XMLATTR$SCH this is largely constant, that is, independent of the type of operation. This indicates that there is a fixed overhead related to the schema whatever that would be. For element-centric XML it looks a little different; I'll return to that in a second.

Over to the string section, and XMLATTR$SCH is still behind XMLATTR, but at 50 000 elements, the absolute difference is now only 40-50 ms, rather than 550-600 ms. If there is an overhead for typed XML, why is this so much higher for integer values than for strings? At first, my speculation was that SQL Server gets the idea to validate the XML document at run-time, although the document was validated already when it was assigned to the typed variable. Alejandro Mesa pointed out an issue with the query plan. While the query plans for typed and untyped attribute-centric XML have identical shapes, there is a Stream Aggregate operator which in the integer case for typed XML includes a very complex expression. Maybe it is only an issue of an inefficient implementation? Naïvely, I would rather expect typed XML to be faster, since the integer value could be stored as such, and there would be no need for conversion at run-time.

If we switch to XMLELEM$SCH there are some good news. To wit, for strings it's fast! For the raw unpack of strings, XMLELEM$SCH needs only 70 % of the execution time for XMLATTR, and if you switch over to Rankings, you can see that it sneaks in 7^th place, beaten only by the TVP and the CLR methods. For JOIN and COUNT it makes #8. The reason for this good showing is found in the query plan, which is radically different from what I get with any other XML method: there is a Merge Join between the two XML readers. This merge join happens with integer data as well, but as with attribute-centric XML, there is a higher cost for the schema when there are integers involved. You can also tell from the absolute difference that the gains from the Merge Join are not equally good when there is a table involved.

Unfortunately, there is a problem. Switch to the Timings sheet and look at the numbers for the EXISTS operator, and you can see that we have a total disaster, not the least for integers where XMLELEM$SCH is slower than any other method. The basic structure of the query plans is the same for integer and strings. The test table is scanned once, and then there is a Nested Loops operator to a Lazy Spool which is fed by the Merge Join. In plain English we can describe the plan in this way: SQL Server traverses the table, and for each row it looks if the value in the table appears in the XML document. A really silly plan if you ask me.

So all and all, while XMLELEM$SCH gives good performance in some cases, it's overall behaviour makes it nothing to recommend. And, as we shall see when we look at call overhead, it's good showing for string data is a mirage.

Looking at fn_nums

Another new set of contenders in this test was the functions based on Itzik Ben-Gan's fn_nums. As with XML, there was a difference between the tests in 2009 and 2010. In 2009, the final SELECT in the UDF read:

SELECT n AS Number FROM Nums WHERE n <= @n;

whereas in the test 2010, I utilised this improvement suggested by Itzik:

 SELECT TOP (@n) n AS Number FROM Nums WHERE n <= @n;

(Actually, I think his suggestion did not include the WHERE clause, but rather he had ORDER BY. He made his suggestion shortly after I had published this article, and I ran some tests with both variations. Later when I re-ran the tests for XML in 2010, I had forgotten about this, and it was not until I had completed the 2010 tests, that I found that I had used a different version of fn_nums. Maybe the performance is better without the WHERE clause, but I decided not to make yet another re-run. The method is not really interesting enough to warrant the effort.)

The best way to evaluate them is to head to the Comparisons sheet, where the lower part compares each method to the corresponding method that uses the Numbers table. I've highlighted in red when the ratio is > 1.2, that is, when the execution time for the fn_nums method exceeds the other method with more than 20 %. And I've highlighted in green when the ratio is < 1, and the fn_nums method thus is faster. Left in black are those that can count as "equal". This part appears in both single-summary.xls and single-summary-2010.xls and the best is to arrange the Excel, so that you can look at Comparison in both files simultaneously.

It does not take very long to see that in the tests from 2009 there is a lot more red than green, and the green spots appear mainly for shorter lengths up to 2 000 elements. The best green value is 0.5 which is for an absolute difference of 1 ms. On the other hand there is no lack of red values > 2. Particularly this is noticeable for FNNUM$CHUNK, although it evens out as the list length grows.

In the results from 2010, the are fewer red spots, and red numbers are smaller. Beside the query-plan disasters (see below), there is only an occasional case where using fn_nums is as much as three times slower than using Numbers. However, the amount of green is about the same, and while using TOP improved things with fn_nums, your odds are still better with Numbers.

This is even more emphasised if you take what in regard what happens for the JOIN and COUNT operations on strings for FNFIX$SINGLE and to which FNNUM$IL in 2010 was a victim as well. The execution for a list with only 20 items is over 40 seconds – even over 60 with FNNUM$IL. Obviously something goes very wrong with the query plan here. For FNFIX$SINGLE, the query plan starts with a number of loop joins with a warning for missing join predicate; but so does all fn_nums plans. Then comes nested loops joins against usrdictwords. But instead of an index seek against the non-clustered index, there is a Lazy Spool operator and scan of the table. The actual plan tells me that the scan is only performed once, but I find this little difficult to believe. It should not take 40 seconds to scan that table. And the execution time should not increase considerably when we go up to 200 elements, but it does. Supposedly spool operator caches the entire table, and then the spool operator is scanned once for each string extracted from the input.

JOIN vs. EXISTS and the others

New for this tests were the EXISTS and COUNT operations. It can be interesting to compare these against JOIN, and I've added a comparison of UNPACK vs. JOIN for completeness. The comments below refer mainly to the tests from 2009, and the data in single-summary.xls.

EXISTS and inline T-SQL functions

Since input lists are unique, JOIN and EXISTS are logically the same. However, SQL Server does not know about this, but in one single case: TVP$PK. For all other methods, SQL Server needs to ensure that a row in the target table does not appear more than once in the result set, even if it would happen to match multiple entries in the input list. A simple way to achieve this is to take the same plan as for the JOIN query, and insert an operator to weed out the duplicates. This operator can be applied on the input list, before the operator that joins the list to the table, or it can be applied to the final result set. But as we shall see, SQL Server does not always produce this sort of plan.

If you look in Timings in the columns for EXISTS, you can directly see that there are a number of methods for which the execution time is excessive. And if you look in the name column, you see that all these names are in dark red. (Almost. The tests in 2010 added XMLELEM$SCH to the club.) If you look a little closer, you see that there is no method name in dark red for which the execution time for EXISTS is normal. Or in other words: for all methods based on an inline function, the optimizer goes seriously wrong. The plans it arrives at, are nowhere close to what I outlined above. I've looked at some of them, and there is a common pattern: there is an index scan of usrdictwords as the outer operator in a loop join and for each row in the table, there is a table spool operator, that may hold the values from the input string. As with fn_nums, I am puzzled. A scan of the table should not take that long time, and furthermore if this was all there is to it, there should hardly be any difference in execution time between 20 and 200 elements, but in most cases the execution time rises almost with a factor of 10.

Most of these plans include parallelism, but it does not help to turn it off. All that changes is that the parallelism operators go away; else the plans are the same. And performance is not any better.

As for why the optimizer produces these plans, I don't know. I thought that the poor estimates could have something to do with it, so I reduced Numbers to 256 numbers and I modified fn_nums similarly. 256 is just above the standard assumption for XML (200 rows) and far lower the assumption for a CLR function (1 000 rows). But, no, still the same very poor plans.

EXISTS and the rest

For remaining comparisons, turn to the tab Compare to JOIN where you find the difference in ms between JOIN on the one hand and UNPACK, COUNT and EXISTS on the other. A positive number means that the operation is faster than the join, a negative that it is slower.

Above I said that the plan for EXISTS could be the same plan as for JOIN, but with one extra operator to weed out the duplicates. From this follows that we can expect EXISTS to perform slower than JOIN. That operator can be either a Distinct Sort or a Hash Aggregate. If you look at the numbers, you see that there mainly is a real cost only for the longest list length of 50 000 list elements.

Interesting enough, there are a few cases where EXISTS is faster: CLR, MANYSELECT, XMLATTR and TVP$PK. This can be explained by other plan differences for the first two. The CLR plans for JOIN includes a Table Spool operator which is missing from the EXISTS plans. MANYSELECT has a parallel plan for EXISTS over strings, but not for JOIN. For integers, it does not need any distinct operator, as it uses a hash join. For XMLATTR and TVP$PK there are no plan differences, but keep in mind that compared to the total execution times, these differences are small overall.

If you look at the numbers in Compare to JOIN, you can see that TVP$NOPK has a larger cost for EXISTS than any other method. You can also see that for the CLR, EXISTS is faster over strings in all cases, but for integers it is slower for the longest input lists. These differences depend on whether SQL Server uses sorting or hashing to deal with the duplicates. Sorting is the most common choice, but for TVP$NOPK and CLR over integers, SQL Server opts to use hashing, and the for 50 000 elements the latter is slower.

This was certainly a surprise to me, but I think I found the answer in the default trace. (A trace which starts when SQL Server starts, unless you configure it not to.) I found plenty of Hash Warning and Sort Warning events in the trace. Since my timings table includes a starttime column as well as the duration, I could correlate the events with the test procedure that was running at the time. All EXISTS queries that use any sort or hash operator to dedup generated the corresponding warning at 10 000 and 50 000 list elements. At 50 000 elements, there was a hash-recursion level of 3. All queries with sorting had to do multiple passes at 50 000 elements. (For XMLATTR this also happened for strings at 10 000 elements.)

Apparently hashing suffers more from spilling to disk than sorting. But it's important to note that whether spilling occurs depends on the amount of available memory. My laptop has 4 GB of memory and many production machines top that. On a more able machine, you may see much smaller differences between JOIN and EXISTS. Then again, on 32-bit SQL Server it does not help how much memory you cram in, as on 32-bit SQL Server sorting and hashing cannot used the extended memory space beyond the plain 2 GB.

COUNT

There is reason to believe that COUNT should always be faster than JOIN, because it is the same query, with the difference that we aggregate the result into a variable instead of inserting rows into a temp table, which of course yields some overhead.

And indeed there are no negative numbers for COUNT (save one value for FNFIX$SINGLE where the total execution time is over five minutes, and thus is completely uninteresting). There is a great deal of variation from method to method, and I will only make a few observations.

The methods that build on a multi-statement function, that is all with the name in black but MANYSELECT, have very similar numbers, and this is to be expected. The method to unpack the list is hidden into the function, and other mechanisms determine the difference between inserting the result from the JOIN query into a temp table, and running the COUNT query.

For some methods, parallelism plays a part, and it cuts both ways. FIX$BINARY, FIX$SINGLE, FIX$MULTI, FIX$MULTI2 and TBLNUM$IL all have a parallel plan for COUNT but not for JOIN; this applies to both data types. A few more methods have a parallel plan for COUNT but not for JOIN for one of the data types. For TBLNUM$IL parallelism gives a tremendous boost, and the same is true for FNNUM$IL over strings. The simple fixed-length methods also seem to benefit from parallelism. But for FIX$MULTI and FIX$MULTI2, the numbers for COUNT are lower than for most other methods. (Please disregard FIX$MULTI2 over 50 000 strings; the total execution time is over ten minutes which is just bizarre.)

For TVP methods, in most cases, the numbers for COUNT are lower than for the multi-statement functions. There is a mix of parallel and non-parallel plans, but there is no obvious pattern. Maybe there simply is a difference in how fast different methods can populate the temp table. Indeed, methods that involve a multi-statement function first need to fill the return table, whereas a query using a TVP can start to fill the temp table faster, and thus gets a lower overhead. But I am only speculating here.

For the CLR, there is a Table Spool operator in the plan for JOIN for both data types that is not present for the COUNT plans. However, the spool operator sits in different places. For strings, it comes between the index seek in the non-clustered index and the key lookup in the clustered index, whereas for integer it is just before the INSERT operator. Really what effect it has, and why it is needed, I don't know. But compared to the multi-statement functions, the CLR methods suffer less from the temp table for the integer queries but take a higher toll for the string queries.

UNPACK

Obviously, UNPACK should be faster than JOIN, which it also is, with a few exceptions. One is XMLELEM in the tests from 2009. The other is the CLR methods in the tests from 2010, more about that later. As with COUNT, there are some differences between the methods. UNPACK as such is rarely of much interest, but there are a few observations that can help us to understand how the joins work.

To get a frame of reference, we can start with looking at the methods based on multi-statement functions. As with COUNT and EXISTS, they have very similar numbers, as all that affects is the cost for the JOIN operation.

In the column for 10 000 integers, two methods stand out with high values, MANYSELECT and TVP$PK. But that is because of something we know already: the join for these is done with a hash join, which is a poor choice.

More interesting are the methods for which the numbers are conspicuously low: CTE$IL, the inline functions based on fn_nums, CLR and the TVP methods. When the numbers are low, the join has the same basic query plan as the methods building on multi-statement functions, and there is no parallelism in the game. Yet they have a far lower "overhead" for the join. Apparently these methods can deliver data into the join faster. And if you think of it: CTE$IL and the functions based in fn_nums produce the data elements without touching any sort of table. And likewise does the CLR stream data into the query.

There are some exceptions to this pattern, but for FNFIX$SINGLE over strings and some of the TVP numbers, the explanation is simple: the join plan is different from the "standard" plan. What is a little more mysterious is the CLR. Here the low numbers appear only for the integer queries. For the string queries, CLR has a higher "overhead" for the join than the multi-statement function. As I've already mentioned, the join plans for the CLR include a table spool operator, and so do the UNPACK plans. But as I mentioned, this spool operator sits in different places in the plan, and presumably this affects how quickly CLR can feed the query with data.

The Bottom Line of all this?

So how interesting is this comparison of operators? Well, there are two bottom lines:

Be very careful with using EXISTS with inline T-SQL functions.
You need to benchmark in your own environment with your own queries to see what performance you actually get. The observations I have made here, mainly apply to my test and you cannot rely on them being generally applicable.

The MAX Threshold

In the sheet Ratios to prev, you can see the ratio between the execution time when moving from one list length to another. At the bottom of the sheet, below the coloured areas, you can see the actual ratios between the lengths. I've highlighted in red, if a time ratio exceeds the length ratio with more than 20 %.

In some cases, the high ratios are due to query-plan changes from list length to another. This applies to TVP$PK, MANYSELECT and EXEC$A. A few of the high ratios for short list lengths are probably due to the low resolution of 1 ms.

But beyond that you can see that for integer data, there are a couple of methods where the ratio for 50 000 is far above 5.0, and this happens consistently for all operations. If you look at string data, the pattern is a little a different. Here the jump occurs at 10 000 elements for some methods, and at 50 000 for some other. And if you study this closer, you see that the jump at 10 000 string elements happens for methods that use fixed-length input.

This is nothing new. This is something I also discuss in the older performance appendix, in the section Fixed-Length and the String-Length Threshold. A short recap: when the length of the value of any of the MAX data types exceeds a certain threshold, this leads to a jump in execution time. (Possibly because SQL Server starts spilling the variable to disk, or at least enables it to do so.) This threshold is around 500 000 bytes in 64-bit SQL Server. (And 750 000 bytes in 32-bit.). The results of the tests in 2009 agree with the predictions I made in the older performance appendix.

That is, for 50 000 elements all input lists exceed the limit, save for the methods that work on binary input. But this effect only strikes at pure T-SQL methods; CLR and XML are unaffected. And so are the TVP methods, since they don't work with any string at all. Furthermore, methods that apply chunking are largely unaffected, because they don't perform very many operations on the MAX string. This leaves us with the following methods being victims: CTE$IL and CTE$MSTMT, all fixed-length methods but the binary methods, FNNUM$IL, FNNUM$MSTMT, ITER$SIMPLE, TBLNUM$IL and TBLNUM$MSTMT. The one method that I cannot determine whether it is affected or not is MANYSELECT, since this method is also affected by plan changes.

It is also striking that some methods takes a larger toll of this effect than others. For ITER$SIMPLE the factor is 7-7.5, while for TBLNUM$IL it ranges from 23 to 33, with the net effect that TBLNUM$IL is slower than ITER$SIMPLE for 50 000 elements, and ITER$SIMPLE is not known for its efficiency.

There is one more observation in this area and that is the total breakdown for FIX$MULTI2 at 50 000 string elements. I have not investigated very closely why this happens, but so much is clear that it is not a query-plan change; the plan is the same as for other list lengths. I assume that SQL Server hits some resource limit causing it to read something to or from disk over and over again. In such case, the performance should be better if there is more memory available.

Comparing the Original Test and the Rerun

In single-summary-2010.xls there is a sheet Compare 2009-2010 which lists all methods the where the difference is at least 10 ms sorted by the ratio. The lower the ratio, the better the improvement in 2010 and vice versa. Please disregard the three columns to the right for a moment.

We have already covered some of the differences, and it is no surprise that the first 17 entries are for XMLELEM. At that point a couple FNFIX methods appear, and as you recall there was a difference between the implementation of fn_nums that I used in 2009 and 2010. The first "other" method to appear is XMLATTRPOS for the integer JOIN over 50 000 elements; but the improvement is only 71 ms or 5 %. (And there was a difference between 2009 and 2010 in this case as well: the addition of [1]).

At the bottom of the list we find four entries for FNNUM$IL for which the new version of fn_nums caused the query plan to go south when joining over the string column.

Right above FNNUM$IL in the list we find something strange: the entries on rows 316-325 are all UNPACK operations for CLR methods. And all entries are for the two longest list lengths 10 000 and 50 000. Furthermore, the they take a higher toll for the string operations. And to make it even more intriguing, CLR$ADAM takes a slightly less toll than the other methods. The effect of this is that for all CLR methods the UNPACK of 50 000 integers is a few milliseconds slower than the JOIN.

A few more methods are also conspicuously slower in the 2010 tests; if you filter out the CLR methods, you see a lot of rows from the iterative method with an increase in executions time up 29 % from the tests in 2009. If you also filter out the iterative method, you see quite a few EXISTS operations that took up 28 % longer time to execute in 2010.

Since this is the same hardware, and the methods are unchanged, this is a little mysterious. But one thing is different: the tests in 2010, were run on SQL 2008 SP2. Could this be the reason? As it happens, I installed new instance of SQL Server on my laptop, which I only upgraded to SP1. The reason for is that I was taken aback by the breakdown for FNNUM$IL. (I had forgotten I had changed fn_nums.) In the three columns to the right, I have included the numbers from this test. The first is the actual result, the other two columns compare the result with the result in 2009.

Also on this new instance, the UNPACK operations for the CLR ran slower than in 2009, but rather than 30-40 % slower, the increase in execution time stops at 10 %. For the iterative method and the EXISTS operation, the difference is only a few percent. So, yes, it seems that the differences between the tests in 2009 and 2010 can be accounted to SP2. Still it is difficult to understand why only UNPACK for CLR was affected, while the other operations suffered far less degradation.

Now, not all is worse with SP2. Above I said that several FNFIX methods performed better with the new version of fn_nums, but if you look at the columns from that new test for SQL 2008 SP1, you can see that in this test, these methods performed more or less the same as in 2009. So the reason that some FNNUM methods performed better in 2010 is more due to that SP2 produces a parallel plans, than the change in fn_nums per se. (However, the breakdown for FNNUM$IL is indeed due to the new version of fn_nums.)

Call Overhead

In the tests for SQL 2000 and SQL 2005, I measured execution times server-side only, feeling that client-side timing would suffer from random network disturbances. But with the advent of table-valued parameters, this position became untenable. There is all reason to ask if there is any extra overhead to pass TVPs.

Methodology

I used the same test setup to measure call overhead, as I used for the single-thread tests. In fact, the single-threaded test includes two timers: one server-side within the procedure, and one client-side around the call to the stored procedure.

When investigating how I should measure time client-side, I looked at what the .NET Framework and the Windows API have to offer. In the end, though, I went for a simple-minded approach: I used SQL Server. Before calling a test procedure I called this procedure:

CREATE PROCEDURE start_client_timer AS
   DECLARE @d2 datetime2(3) = sysdatetime()
   DECLARE @b varbinary(128) = convert(varbinary(128), @d2)
   SET CONTEXT_INFO @b

SET CONTEXT_INFO is different from all other SET commands, as the effect of it is not reverted when the scope exits. Then to retrieve the execution-time for the call, I used this procedure:

CREATE PROCEDURE get_clientms @clientms  int          OUTPUT,
                              @starttime datetime2(3) OUTPUT AS
   DECLARE @now datetime2(3) = sysdatetime()
   SELECT @starttime = convert(datetime2(3), context_info())
   SELECT @clientms = datediff(ms, @starttime, @now)

While these calls themselves add an overhead, this means that I was guaranteed that the client-side time never could be less than the time that I measured within the procedure. (With different timers on different computers, there is a risk that one of them would cross a boundary when the other doesn't, so the time measured client-side could appear as lower which of course is undesirable.)

Once equipped with tool, I computed the call overhead as the difference between the total execution time measured client-side and server-side divided by the number of calls. More precisely this overhead includes:

The time it takes for the client API to put the call including the parameters into the TDS stream.
Any network overhead. (In my tests, I had client and server on the same machine, so this overhead is reduced to the overhead of shared memory.)
The time for SQL Server to receive and validate the parameters.

However, the overhead for compiling a procedure is not included, as the first call for each list length is not included in the numbers I present.

Clientsummary.xls

You find the data for the call overhead in the Excel file clientsummary.xls. When you open it, you will see less columns than in single-summary.xls, because I've factored out the operation type. Reasonably, the call overhead does not depend on whether the operation is UNPACK, JOIN, COUNT or EXISTS.

In the Excel book, there are three sheets. For now we only look at the first sheet, Call Overhead which holds the result from the original tests in 2009. Here you see two tables, one for integer data, one for string data. For each data type, I have sorted the methods per performance on the longest list length. I've drawn lines to divide the methods into groups. The numbers you see is the call overhead per call in milliseconds. They are presented with two decimals, which may seem brave, given that the timing resolution is only one millisecond. But the results are actually quite consistent when it comes to the division into groups.

Table-valued Parameters

Let's start with TVPs, since they are the most interesting here. As you can see the call overhead for TVP is considerable. Almost two seconds for 50 000 integers. If you add that to the values on the Timings sheet in single-summary.xls, TVP ends up in a non-descript middle. Still better than MANYSELECT and CTE$IL, but overtaken by ITER$CHUNK. So this is a really big disappointment?

Before you jump to conclusion, keep in mind that test script is written in Perl, and it uses a client API that sits on top of OLE DB, and which I have authored myself. That is, a quite untypical use case. Undoubtedly, most people will use TVPs through SqlClient and ADO .NET. Therefore, I wrote a test program in C# which is based on the same idea as my regular test script, but far less sophisticated and hardwired for the TVP methods. (In fact it is so unsophisticated that I wrote two programs, one for strings and one for numbers, tvptest.cs and tvptest_str.cs.). The only operation I tested was UNPACK, but as noted above the operation type should not matter. I tested to pass both a DataTable and a List<SqlDataRecord>. I measured two aspects of the call overhead: A) only the time for the call to ExecuteNonQuery and B) adding the time it takes to load the DataTable or the List. I made sure that for each test I had a new suite of numbers or words. I only ran each particular test 10 times. The CSV_splitter class? See below under Notes added 2012.

You can view the results of these tests in the sheet Call Overhead ADO. There are some new method names here: TVP$NOPK$DT$A, TVP$PK$LIST$B and so on. The bit DT means that the TVP was passed as a DataTable, and LIST that it was passed as a List<SqlDataRecord>. "Methods" ending in $A are the measurements for the call to ExecuteNonQuery only, whereas $B also includes the time it takes to load the DataTable or the List.

As you can see, with ADO .NET the picture is completely different. In ADO .NET, the overhead for TVP$NOPK is lower than for XML and for strings the results are also better than for fixed-length input. (But keep in mind that that the numbers for XML and fixed-length are measured from Perl, so it is a bit of apples-to-oranges comparison.)

For TVP$PK the situation is a bit bleaker with as much as 350 ms in overhead for 50 000 strings. This higher overhead comes from that SQL Server needs to sort the data to get it into the primary-key index. There are some funny relations going from one length to another: 20 ms for 10 000 integers explode to 150 ms for 50 000 integers, and a similar effect occurs when going from 2 000 to 10 000 strings. As when I explored the differences between EXISTS and JOIN, I looked in the default trace, and I found Sort Warnings that I could correlate to the calls for TVP$PK. Thus, these jumps are due to the sort spilling to disk. This indicates that with a different amount of available memory, I would have seen different results on a 64-bit machine. (A side note: earlier I said that TVP$PK gave better performance with EXISTS queries than TVP$NOPK, because it does not need an operator to eliminate duplicates. A classic case of swings and roundabouts.)

If we also include the time it takes to load the DataTable or the List, the numbers rise quite considerably, but there is no significant difference between the two. However, this test is completely flawed. When I reviewed the code in 2012, I realised that I was loading the data from a DataSet, which is quite expensive to access. No one would that in a real application.

(Note: I first ran the ADO test on my workstation, only for integers. The numbers I got there was about the double than what I later received on my laptop, despite these machines being fairly similar in capacity and configuration. So take the numbers in Call Overhead ADO as a rough sketch.)

Notes added in 2012

When I originally ran these tests in 2009, I had not realised how simple it was to implement a class like the CSV_splitter. I was under the incorrect impression that I would need to implement IDataReader which is a more complex interface. Furthermore, in 2009 I was not aware of that you could specify in the SqlMetaData constructor that a column is unique and sorted. When I made these new revelations in 2012, I ran some quick tests to confirm my assumptions, but I opted not to publish the numbers for various reasons, including the fact that my hardware has changed since then.

Nevertheless, a few theoretical observations can be made. It's reasonable to assume the overhead for pass a CSV_splitter object is higher than for passing a List<SqlDataRecord> that already has been filled in, since when you pass the CSV_splitter, there is a string to be parsed. My quick tests indicates a difference of 5-7 ms on my current hardware. You could argue that my $A test is unfair in comparison to methods that take a comma-separated lists as input, as the list has already been parsed. But that observation is only relevant in real-world situations where the source of the data really comes from a comma-separated list. If the data comes from, say, a stream on a network port, and you have to build a comma-separated string to send it to such a stored procedure, it's the other methods that has an unfair advantage in the test. I need to say it again: if you need absolute top performance, you need to benchmark with your system and your presumptions. And that includes the client.

When it comes to marking the columns as unique and sorted in the List<SqlDataRecord> or the DataTable, how would that affect the call overhead? Reasonably, we could expect the numbers for the PK and NOPK methods to be about the same, but since SQL Server would need to validate the uniqueness and sort order the PK methods would have a slightly higher overhead, and my quick tests confirms this. Then again, I am not sure how relevant this is. If you have 50 000 values that you need to send to SQL Server, how likely is that you get the data from a source which is already sorted? Of course, you can sort client-side, but that sorting would be part of the call overhead. It can be worth adding that in the tests 2009, the data was in fact completely random in order, so for my quick rerun in 2012 I had to redesign that part of the test.

Passing Many Parameters

This far I have not covered the MANYPARAM method (except that I noted that its query plans are huge). If you look at the server-side results in the Timings or Rankings sheet in single-summary.xls you can see that it does very well for the list lengths where it is able to run. For the COUNT on strings it plays an even game with the TVP methods.

But all this changes when we look at the call overhead. The numbers you can see in the Call Overhead sheet in clientsummary.xls are nothing but gross. Already at 20 elements it's 12 ms, and for 2 000 elements, it's more than half a second! There is no method which has a server-side result which is that poor, so in fact that MANYPARAM is the slowest method of them all if we disregard the cases where optimizer goes astray.

To find out whether this was yet another issue with my client API, I wrote a test for MANYPARAM in ADO .NET, this time only for integers. (You find it in manyparamtest.cs.) You can see the result on the Call Overhead ADO sheet. As with TVP the performance in ADO .NET is better, but only with some 25 %. Almost 10 ms for 20 elements, and just below 400 ms for 2 000 elements.

In the ADO sheet there are two entries, MANYPARAM$A and MANYPARAM$B. For $A, the timing covers only the call to ExecuteNonQuery, for $B it also includes the time it takes to fill the parameter collection. The difference between the two is very small, so setting up all those parameters is cheap. But passing them, no, that's darn expensive.

All in all, while using this technique may be acceptable in some limited cases, it's definitely not a viable choice when you need to work with many list elements. Unwieldy code, an upper limit of 2 100 elements and poor performance; there is just nothing to defend it.

XML

For XML two things matter: the length of the document and the fact that SQL Server has to parse the XML documents to be able to store them in its internal format. And not surprisingly, XML has a higher overhead than plain strings. But how much?

Turn to the sheet Call Overhead 2010, where you find all XML methods gathered. You can see that for 50 000 elements the overhead ranges from 100 to 150 ms for untyped XML. Or compared to methods that use comma-separated strings, the overhead is a factor of three to four for integers while for strings it is at most a factor of two. A factor of four may sound a lot, but in absolute terms the difference is 70-100 ms for integers. If you look in the Timings sheet in single-summary-2010.xls, you can see that XMLATTR and XMLELEM$TEXT has a server-side execution time of 830-860 ms for unpacking 50 000 integers, and adding the client overhead does not really change their ranking.

We can also see that element-centric XML has lower overhead than attribute-centric, which is likely to be due to that the element-centric XML documents are shorter. If you look in Call Overhead, you can see that in 2009 element-centric XML had higher overhead; in those tests, element-centric XML incorrectly had one extra level of nodes, as I described above.

If the overhead for untyped XML was nothing be alarmed over, it's a different matter with the methods that employ typed XML. For integer data, their call overhead is a factor of seven compared to untyped XML. For strings, the overhead is "only" a factor of four. There is an apparent reason for this overhead: SQL Server has to go through the entire XML document and validate that it conforms to the schema. The overhead is higher for integer data, since in this case SQL Server needs to validate that all values are legal integers.

The overhead for XMLATTR$SCH and XMLELEM$SCH is so high, that if you add the overhead to the server-side timings shown in the Timings sheet, this affects their ranking significantly. As you may recall XMLELEM$SCH performed very well for string data, but if you take the call overhead in account, it performs somewhat worse than XMLATTR and XMLELEM$TEXT. Whence why I said the good performance of XMLELEM$SCH is a mirage. Using typed XML only to improve performance is a non-starter.

I should add that for XML I have not performed any tests from ADO .NET. While my API may have a higher overhead, it seems conceivable that most of the higher overhead for typed XML is due to SQL Server's validation which should be independent of the API.

Plain Strings

Left are the methods that use plain strings. That is, all methods that takes an nvarchar(MAX) or varbinary(MAX) parameter, and for which there cannot be any difference on the SQL Server side in receiving the parameter. All that should matter is the length of the parameter string – or so one would think.

Looking in the Call Overhead sheet, we see that the values fall in distinct groups, and if we look closer, we can see that within each group, the input format is the same. For integers the groups are, in order:

Binary input.
Space-separated strings.
Fixed length.
Comma-separated strings.

The first three groups are what we could expect. The shorter the string, the lesser the overhead. Keep in mind that these tests use shared memory, so with a real network link, these differences are likely to be more accentuated. But the last group is very funny. The space-separated and the comma-separated strings have exactly the same length, so there is no apparent reason why the overhead would be 12 milliseconds more for 50 000 integers only because the separator is a comma rather than a space. I don't have any answer for this, but I strongly suspect that this is something related my test setup or my Perl API, and this is not likely to appear with an ADO .NET application.

For strings, we see the following groups:

Comma-separated strings.
Fixed-length.
Dynamic SQL.

Again, there is something strange going on. The input strings for dynamic SQL are longer than the regular comma-separated strings, since each value is quoted, so far so good. But even with these extra quotes, the strings for dynamic SQL are considerably shorter than the fixed-length strings with 30 positions per element. And again, I have no explanation to offer but the assumption that it is not likely to show in other APIs.(Compilation? No, the compilation overhead for dynamic SQL is included in the server-side times.)

In any case, while these differences are funny, the impact on the total execution time is limited.

If you compare Call Overhead and Call Overhead 2010, you can see that for strings, at least, the call overhead was clearly lower in 2010. Maybe this is due to improvements in the OLE DB provider in SQL Server Native Client for Service Pack 2.

Multi-threaded Tests

It is with some doubt that I publish the results from my multi-threaded tests, because about the only safe conclusion I am able to draw is that the tests are inconclusive. So take what follows with several grains of salt. Note that I did not rerun these tests in 2010.

Test Setup

The test script, testscript.pl, is also able to run multi-threaded tests, but there are significant differences how the tests are run.

When the test script runs multi-threaded tests, the first thing it does is to select 10 001 lists each of integer and string data. The lists vary in lengths, repeating in groups of 11. Lists 1-3, 12-14, 22-24 etc have 20 elements. List 4, 15, 26 etc have five elements. Then come two lists with 50 elements, two with 100 and then one each with 200, 650, and 2 000 elements. (The reason that it is not a random distribution is simply laziness.) I chose mainly short lists, since it seems unlikely that you will have thousands of users hammering the server with lists with 10 000 elements.

The test script iterates over the numbers of threads to test for – 1, 5, 25 and 100 threads – and runs each test procedure for these numbers of threads. Before spawning the threads, the script first clears both the buffer cache and the plan cache. To factor compilation out of the test, the script runs the test procedure in question, using the very first input list, to enter a plan for the procedure into cache. This means that the plan is always created for 20 elements in the input list. As you see, the multi-threaded tests runs with the opposite presumptions than the single-threaded tests, starting with a cold cache. I opted to do this for two reasons: 1) Some methods (read dynamic SQL) may happen to squeeze data out of the cache, thereby leaving more work to the next guy. 2) With reads from disk, I hoped to see a better scaling effect of using more threads.

The scripts proceeds to take an exclusive application lock on session level whereupon it spawns the threads. The first thing for a thread is to call a version of start_client_timer that first tries to get a shared application lock on the resource locked by the main thread of the test script. Once the main thread has spawned all threads, it releases the application lock (after a short wait), and thereby starting the test itself. The main thread then waits for all subthreads to complete.

The remaining 10 000 lists are divided in segments over the subthreads. For instance, when running with five threads, thread 1 uses lists 2-2001, thread 2 uses lists 2002-4001 and so on. That is, no matter if the number of threads is 1, 5, 25 or 100, the script makes the same number of calls to the test procedure. The subthreads collect the timings from the procedure, but save this in memory only, and move directly to the next call. When a subthread has run all its calls, it returns the data to the main thread, that saves the data to the database when the test is over.

What to Measure

A challenge for me was to determine what times to use. The server-side timings were basically useless, since I had opted to use short lists. I first tried to use the average or median of the total execution time for the threads, but I faced distribution problems. Most threads had about the same execution time, but an occasional thread could complete much faster – and I mean much faster: subsecond, when most other threads ran for ten seconds or more. Presumably, this was due to some weirdness in the scheduling.

Eventually, I decided to simply compute the difference between the starting time for the first thread and the end time for the last thread. After all, in the real world what counts is wall-clock time. But this choice is not entirely without problems. While a test runs for a several minutes, all I get is one single value, and for some methods the execution time for a test proved to be quite different from one execution to another. Running the entire test suite to collect one single value per test took 18-20 hours, so it was not realistic to run many times to even out these variations.

The results I publish are the average of six full runs. This was not my original plan, but because I found minor flaws with regards to tempdb size (see below) in my test setup, I had reason to rerun anyway. In the end I was never able to resolve the tempdb issue entirely, and I decided that the impact was small enough that using an average of the six runs I had was better that just presenting the results from a single potentially flawless run.

Tested Methods and Operations

For the multi-threaded tests, I tested UNPACK, JOIN and COUNT, but I left out EXISTS as it did not seem interesting enough. I ran the test for both integers and strings.

I made some exclusions. UNPACK for XMLELEM (both data types) and JOIN and COUNT for FNFIX$SINGLE on strings ran too slow in the single-threaded tests to permit them be included in this test. I also largely excluded the TVP methods and MANYPARAM due to their high call overhead. This is a pity, as it would be interesting to see how well the TVP methods scale, but because of my Perl API, this was not possible. I did however included them in occasional runs, but I am not comparing their results with the other methods.

Hardware etc.

As with the single-threaded tests, I used the SQL Server 2008 instance I have on my laptop. The test script ran on my desktop machine, which runs Vista Ultimate x64 SP2 and has 6 GB of RAM and a 2.2 GHz dual-core CPU. (The test script consumes quite a lot of memory for 100 threads, over 2 GB, so running both the test script and SQL Server on the laptop did not seem like a good idea.)

It goes without saying that this system is less than optimal for multi-thread tests: only two CPUs, and all database files, including tempdb, on a single disk.

As I mentioned there was an issue with tempdb in the tests: I discovered very late that I had not sized tempdb correctly, and the size I used for the single-threaded tests was insufficient, so SQL Server would autogrow the data and log files for tempdb while the test was running. As autogrow could affect on the measurements, I wanted to eliminate it. But no matter how big I sized the log, autogrow still occurred. In the final run, the data file for tempdb was 1 GB and the log was 250 MB. In this run, autogrow appeared in a place where it had not appeared before. This was when I gave up, and decided to use the average of the six runs I had completed.

So, yes, the test results include distortion caused by autogrow, but I the effect seems to be minor. I was able to correlate the autogrow events in the default trace with my test results, and I could see during which procedures that had been running when autogrow took place. Their values did not stand out as particularly extreme compared to other runs where autogrow did not appear for the same procedure.

multithreadsummary.xls

You can find the results of the multi-threaded tests in the Excel file multithreadsummary.xls. This file has a number of tabs:

Totaltime – The total execution time in milliseconds for each test.

Deltatime – The difference in execution time between this number of threads and the previous number. That is, in the column 100 you see the difference in execution time between 100 and 25 threads. I have put positive numbers in red, as the general assumption is that the more threads for the same work, the shorter the execution time.

Rankings – The ranking within this combination of operation, data type and number of threads for the total execution time.

Ratio to top – How many times slower this method ran compared to the fastest method for this combination.

Threadratio – The ratio in execution time between this test and the same test with one thread. The higher the value, the better does the method scale. I have put ratios less than 1.1 in red, as this is very poor .

Ranking threadratio – Ranking of the values in the Threadratio sheet.

Distance threadratio – The difference in threadratio between this method, and the method with the highest threadratio for this combination of operation, data type and number of threads. In this sheet I have shaded the values in stronger and stronger shades of red in steps of 0.4: values up to 0.4 are unshaded, values between 0.4 and 0.8 are pink, values between 0.8 and 1.2 are more reddish pink, values between 1.2 and 1.6 are light red, and values over 1.6 are fully red. To make the JOIN column easier to read, I'm using a different colour on this sheet for JOIN. (Note: if you are viewing the Excel book with Excel 2003, you will only see three different shades: values up 0.8 appear as unshaded, and values between 0.8 and 1.6 have the same light-red colour.)

Individual runs – This sheet gives the detail data from the six runs. The column totaltime is the same value as in the Totaltime sheet, that is the average of all six runs. The columns total1-total6 is the total execution time for that particular run, and the columns ratio1-ratio6 gives the ratio between the execution time for that run and the average. I've highlighted ratios above 1.1 or below 0.9 in red.

As in single-summary.xls, the methods appear in alphabetic order with the same colour coding. The TVP methods and MANYPARAM appear "outside competition" at the bottom of the Totaltime, Deltatime and Threadratio sheets only.

General Pattern

Let's first look at the Deltatime sheet. We can see that when we go from one to five threads, the execution time falls considerably in most cases, with the exception of fixed-lengths methods for integer data with JOIN and UNPACK.

When we move to 25 threads, on the other hand, there is a rise in execution time for most methods, particularly for JOIN and UNPACK. Curiously enough, the fixed-length methods that scaled negatively for five threads, are now able to improve their execution time. For COUNT there are more methods for which the execution time continues to decrease, if not that much.

Finally, for 100 threads the vast majority of the methods have a drop in performance from 25 threads for UNPACK and JOIN. For COUNT there are still a few that scale positively.

Overall, we get best scaling with COUNT, which seems reasonable to expect. With the other operations we write to temp tables, whereas with COUNT we only read from cache. On this modest hardware, it is likely that all these processes saturate the disk.

If you move over to the sheet Distance threadratio, there is another interesting observation. You can see that in the column for JOIN on strings there are only a handful on pink cells and none that is redder, whereas other datatype/optype columns have more cells that are shaded and in different shades of red. I guess this tells us that the join operation over strings in this test does not leave much scaling potential, as, say, the COUNT operation over integer does. Not only do we have the temp tables, but string-comparison is more CPU-intensive, so that resource too is saturated.

Winners...

To evaluate the results, we need look in several places in the Excel book. The Rankings sheet gives the ranking of the overall performance, but it is in the Threadratio and Ranking threadratio sheets, we can see the actual effect of adding many parallel processes. Both are important: there are some methods that ranks good on the threadratio, but still ranks mediocre or poorly in total execution time.

The results do not lend themselves to very many conclusions: as I said, the main conclusion is that the results are inconclusive. Nevertheless, it is fairly easy to identify a winner, and that is the CLR family. If you look in Rankings, you see that no CLR method ranks lower than sixth place for five or more threads. And for 25 and 100 threads, CLR holds places 1-4 for all tests. Likewise, if you look in Ranking threadratio, you can see a similar pattern. For integer data, other methods manage to sneak into the top 4 in three cases only. For string data the picture is more mixed, particularly for the JOIN operation where the winners in all three cases are elsewhere. But overall, the CLR scales very well.

Can we also draw any conclusion which of the CLR methods that is the best? It is difficult not to notice that CLR$ADAM tops all tests for integer data in Ranking threadratio. And if you look a little closer, you see that for string data CLR$FIX ranks best of the CLR methods. If you move over to the sheet Distance threadratio, you can see that in some cases they win only with a slim margin. But there are also COUNT on integers and UNPACK and COUNT for strings where CLR$ADAM and CLR$FIX make decisive wins. So do these methods scale better than the other two? No, I think this is a mirage. Move over to Totaltime, and the picture is different. The two "winners" tends to be the slowest CLR methods with a single thread, so they are just catching up. (As for why they are slower, there are hints in the previous sections in the article. In the single-threaded tests CLR$ITER and CLR$SPLIT were somewhat faster than the other two for the longest lists. Maybe, we are be able to see this small difference also for shorter lists when we make so many calls. A more important factor, though, may be the difference in call overhead. Recall that I've run CLR$ADAM with comma-separated input, which for some strange reason has a higher call overhead than space-separated. You can note that CLR$ADAM lags more behind on integer lists, and CLR$FIX on string lists.)

Thus, from these tests, you cannot say that one CLR method is better or worse than the other, only that CLR in general scales well. (But recall the experiences of SQL Server MVPs Adam Machanic and Paul Nielsen when using String.Split.)

And, please keep in mind that these tests do not cover TVPs. You can see its data in the Threadratio sheet, and it's fairly drab. But due to the high call overhead in my Perl API, we don't know what we are measuring. As the time to make the call exceeds the execution time in SQL Server, we are not stressing SQL Server like we do with the other methods, that's for sure. So maybe CLR is only the winner in the absence of TVPs.

Are there any other winners? Well, there are a few surprise showings. Look at which method that gets to #1 for the JOIN on strings with 100 threads: CTE$CHUNK, who would have thought? In some individual runs I saw it make #1 also for the JOIN on integers. Now, what is going on here I don't know, but you can see that CTE$CHUNK ranks much better for JOIN and UNPACK that include a temp table, than for COUNT that does not. The same also applies for CTE$MSTMT, the winner for the join over strings with 25 threads. Maybe the CTE methods are slow to feed the temp table, so they have a better potential for scaling. After all, if you look in Rankings, CTE$CHUNK ranks no better than #13 for the integer join, and #10 for the string join, so it's #1 showing in threadratio is probably more a piece of curio than anything else.

While you are in the Rankings sheet, look at the row for MANYSELECT and see that it is in place 24 for all tests on strings, that is last or second last. You can also see that for integers, it never ranks better than #20. Now switch to Ranking threadratio: for the join over strings: it grabs places 2 and 3 for 25 and 100 threads respectively. It's rankings for COUNT are good too. But since the total time is poor, the method is still not of interest.

While not directly a winner, let me also note that XML does fairly well. For strings, XMLATTR ranks between places 5 and 9 for total execution time and it ranks decently also on threadratio.

... and Losers

The most noticeable loser is the fixed-length method, and more precisely when you use the Numbers table. As you can see in the Threadratio sheet, there are several entries in red, meaning that the ratio with regards to a single thread is less than 1.1. There are even values below 1, meaning that it took longer time to run the test with many threads instead of one!

Certainly, the picture is not clear-cut. If you look in Ranking Threadratio, there are some tests where the fixed-length method ranks fairly decently. FIX$MSTMT even manages to rank as #1 and #2 for the JOIN and COUNT over strings for 5 threads respectively. But then as the number of threads increases, it falls to lower rankings. And you can see the same pattern for FIX$BINARY and FIX$SINGLE over integers. However, there are also some cases with the opposite pattern. FIX$MULTI and FIX$MULTI2 ties for #8 for COUNT over integers for 100 threads while being at places 19 and 21 for 5 threads, and FIX$SINGLE has a similar development for COUNT over strings: #12 for 5 threads, and then #9 for both 25 and 100 threads.

But given that the fixed-length method is one of the fastest methods in the single-threaded test, the overall result is a disappointment. If you look in the Rankings sheet, you can see some really poor rankings, not the least for the join over integers for 5 threads, where FIX$BINARY is at #17, and the others (save FIX$MSTMT) fare even poorer.

It is interesting to compare with the FNFIX methods. Looking in Rankings Threadratio, it is true that FNFIX$MSTMT ranks lower than FIX$MSTMT, but if you look at FNFIX$BINARY and FNFIX$SINGLE they rank a whole lot better than FIX$BINARY and FIX$SINGLE, except for the join over integers where the FNFIX methods also scale poorly.

Another group that does not scale very well are the FNNUM methods. In Ranking Threadratio, they never make it to top 10, but in one single case and then just barely. The TBLNUM methods have a more mixed picture, but TBLNUM$IL does really bad on the two join operations. TBLNUM$CHUNK and TBLNUM$MSTMT are unable to make of use the good scaling potential for the COUNT over integers, and for 100 threads their threadratio is more than 2.0 less than the leader.

By now, I think I've mentioned all groups of methods but the iterative method and dynamic SQL: both make a mediocre appearance. Although, I will have to admit that I thought EXEC$A would lose big time, due to all its compilation. But it only hits the bottom for COUNT over integers.

Parallelism?

A final question is: could it be that the poor showing of some methods is due to a parallel plan? Conceivably, with a parallel plan, the CPUs get saturated more easily, as all threads try to use all cores at once. I investigated this, and the answer is no. Rather it is the other way round. That is, methods that have a parallel plan for one operation tend scale better for that operation than for operations where they have a serial plan. For the record, here are all cases of parallel plans in the multi-threaded test:

Method	Data type	Operation
CTE$IL	Str	COUNT
FIX$BINARY	Int	COUNT
FIX$MULTI	Int	COUNT
FIX$MULTI	Int	UNPACK
FIX$MULTI	Str	COUNT
FIX$MULTI	Str	UNPACK
FIX$MULTI2	Int	COUNT
FIX$MULTI2	Int	UNPACK
FIX$MULTI2	Str	COUNT
FIX$MULTI2	Str	UNPACK
FIX$SINGLE	Int	COUNT
FIX$SINGLE	Str	COUNT
FNNUM$IL	Str	COUNT
TBLNUM$IL	Int	COUNT
TBLNUM$IL	Str	COUNT

Many fixed-length methods in the pack, and generally they did fairly well on COUNT, and this applies to TBLNUM$IL as well.

Conclusion

Did you really make it this far? Then I don't know who is the most geeky, you who read all this – or I who wrote it. Anyway, you have now seen the results from tests of various list-to-table methods. The single-threaded tests are tested and tried – this is the third time I run them. New for this round were observations of call overhead, and multi-threaded tests.

In the single-threaded tests you saw that TVP, CLR and fixed-length were generally the fastest. But a poor choice of the optimizer can send a fast method down the list, or a successful parallel plan can send some other method to the top. And for all methods based on T-SQL, there is a threshold around 500 000 bytes (x64) or 750 000 bytes (x86) where working with nvarchar(MAX) becomes slower. The method most prone to take hit from this is fixed-length.

But what is more important are the really catastrophic results you can get with T-SQL inline functions. For many applications the iterative method will do just fine, never mind that in these tests it ranks poorly. But few applications can accept that it takes 35 seconds to handle a list of 20 elements.

The client-side tests, showed that TVPs have some more overhead than plain strings, but if you are using ADO .NET this is nothing to be alarmed of. (But if you use Win32::SqlServer, you may have reason for concern.) The client-side tests also showed that using many parameters is dead in the water, unless you only cap it at a very small number.

The multi-thread tests showed that the CLR scales well, and the fixed-length method less so. But overall the result from this test is dubious, given the system I ran the tests on. And the fact that TVPs could not compete also makes those tests less interesting.

And it cannot be stressed enough: if you think you need top-notch performance in your system, you need to run your own benchmarks with your queries, with your hardware. And with the environment you expect. That is, if you need it for long strings, test with long strings. If you expect many multiple processes to hammer the server, test with many processes. And keep in mind that much more than the method itself, what will be decisive is the query plan you get.

Revisions

2012-07-01 – Added the subsection Notes from 2012 in the section on call overhead for TVP in conjunction with updating the main Arrays and Lists in SQL Server 2008.

2010-10-10 – I had to rerun the test suite to account for several issues with XML that I had been unaware of when I ran tests originally. See the XML section for details on the differences. The rerun also tests a different version of fn_nums.

2010-01-06 – First revision

Back to my home page.

Arrays and Lists in SQL Server Data from Performance Tests in SQL 2008

Introduction

UNPACK

JOIN

COUNT

EXISTS

Differences Between 2009 and 2010

Getting the List Position with XMLATTRPOS

Element-Centric XML

Typed XML

EXISTS and inline T-SQL functions

EXISTS and the rest

COUNT

UNPACK

The Bottom Line of all this?

Notes added in 2012

Arrays and Lists in SQL Server
Data from Performance Tests in SQL 2008