Passing array/list to a Stored Procedure in SQL Server 2005
Arrays and Lists in SQL Server 2005
This is an excellent excerpt from Erland Sommarskog
In the public forums for SQL Server, you often see people asking How do I use arrays in SQL Server? Or Why does SELECT * FROM tbl WHERE col IN (@list) not work? The short answer to the first question is that SQL Server does not have arrays – SQL Server has tables. However, you cannot specify a table as input to SQL Server from a client. What you can do is to specify a string and unpack that into a table.
This article describes a number of different ways to do this, both good and bad. I first give a background to the problem (including a quick, if the not the best, solution). I then give a brief overview over the methods, whereupon I discuss general issues that apply, no matter which method you use. Having dealt with these introductory topics, I devote the rest of the main article to detailed descriptions of all methods, and I discuss their strengths and weaknesses. To find out how well these methods perform, I have conducted performance tests, and I relate the results of these tests in a separate appendix.
If you feel deterred by the sheer length of this article, you should be relieved to know that this is the kind of article where you may come and go as you please. If you are a plain SQL programmer who want to know "how do I?", you can drop off already after the first solution if you are in a hurry. If you have a little more time, you read the background, the overview and the General Considerations section, and study the methods that look the most appealing to you. True SQL buffs who are curious about the performance numbers, might find the explanations of the methods a little tedious and may prefer to skim these parts, and then go directly to the performance-test appendix.
Note: this article covers SQL 2005 only (save a few back references to SQL 2000). If you are using SQL 2000, there is an older version of this article that covers SQL 2000, SQL 7 and SQL 6.5.
Note: all samples in this article refer to the Northwind database. This database does not ship with SQL 2005, but you can download the script to install it from Microsoft's web site.
Here is a table of contents:
Introduction
Background
Comma-separated List of Values
Inserting Many Rows
Overview of the Methods
General Considerations
Interface
Robustness
Performance Considerations
varchar vs. nvarchar
Inline, Multi-Statement and Temp Tables
A Caching Problem with SQL Inline
MAX Types vs. Regular (n)varchar
Collations
Unpacking Lists in a Table
The Iterative Method
List-of-integers
List-of-strings
Using the CLR
Introducing the CLR
CLR Functions Using Split
Rolling Our Own in the CLR
XML
Using a Table of Numbers
Fixed-Length Array Elements
Using Recursive CTEs
Dynamic SQL
Making the List into Many SELECT
Really Slow Methods
Conclusion
Acknowledgements and Feedback
Revisions
You have a number of key values, identifying a couple of rows in a table, and you want to retrieve these rows. If you are the sort of person who composes your SQL statements in client code, you might have something that looks like this:
SQL = "SELECT ProductID, ProductName FROM Northwind..Products " & _
"WHERE ProductID IN (" & List & ")"
rs = cmd.Execute(SQL)
List is here a variable which you somewhere have assigned a string value of a comma-separated list, for instance "9, 12, 27, 39".
This sort of code above is bad practice, because you should never interpolate parameter values into your query string. (Why is beyond the scope of this article, but I discuss this in detail in my article The Curse and Blessings of Dynamic SQL, particularly in the sections on SQL Injection and Caching Query Plans.)
Since this is bad practice, you want to use stored procedures. However, at first glance you don't seem to find that any apparent way of doing this. Many have tried with:
CREATE PROCEDURE get_product_names @ids varchar(50) AS
SELECT ProductID, ProductName
FROM Northwind..Products
WHERE ProductID IN (@ids)
But when they test this:
EXEC get_product_names '9, 12, 27, 37'
The reward is this error message:
Server: Msg 245, Level 16, State 1, Procedure get_product_names, Line 2
Syntax error converting the varchar value '9, 12, 27, 37' to a column
of data type int.
This fails, because we are no longer composing an SQL statement dynamically, and @ids is just one value in the IN clause. An IN clause could also read:
... WHERE col IN (@a, @b, @c)
Or more directly, consider this little script:
CREATE TABLE #csv (a varchar(20) NOT NULL)
go
INSERT #csv (a) VALUES ('9, 12, 27, 37')
INSERT #csv (a) VALUES ('something else')
SELECT a FROM #csv WHERE a IN ('9, 12, 27, 37')
The correct way of handling the situation is to use a function that unpacks the string into a table. Here is a very simple such function:
CREATE FUNCTION iter$simple_intlist_to_tbl (@list nvarchar(MAX))
RETURNS @tbl TABLE (number int NOT NULL) AS
BEGIN
DECLARE @pos int,
@nextpos int,
@valuelen int
SELECT @pos = 0, @nextpos = 1
WHILE @nextpos > 0
BEGIN
SELECT @nextpos = charindex(',', @list, @pos + 1)
SELECT @valuelen = CASE WHEN @nextpos > 0
THEN @nextpos
ELSE len(@list) + 1
END - @pos - 1
INSERT @tbl (number)
VALUES (convert(int, substring(@list, @pos + 1, @valuelen)))
SELECT @pos = @nextpos
END
RETURN
END
The function simply iterates over the string looking for commas, and extracts the values one by one. The only complexity is the logic to handle the last value in the string. Here is an example of how you could use this function:
CREATE PROCEDURE get_product_names_iter @ids varchar(50) AS
SELECT P.ProductName, P.ProductID
FROM Northwind..Products P
JOIN iter$simple_intlist_to_tbl(@ids) i ON P.ProductID = i.number
go
EXEC get_product_names_iter '9, 12, 27, 37'
So there, you have a solution to the problem. But let me say it directly that the function above is not extremely speedy, and almost all methods I will discuss in this article are faster than the one above. Nevertheless it's good enough for many situations, particularly if your list is short. So if you are in a hurry and want to move on with your project, feel free to stop here and come back later if you are curious or run into problems with performance and need to learn more.
If you are in the situation your lists are in a table column, and you want to rush, head for the section Unpacking Lists in a Table.
A related problem is you need to insert many rows. You think that one INSERT statement at a time, or calling a stored procedure for every row will be slow because of the many network roundtrips, so you would like to some more efficient method.
Of course, if you are looking into to importing data on a major scale, you should consider SQL Server Integration Services (SSIS) or to use bulk load with BCP or BULK INSERT. But sometimes SSIS or bulk load would shoot over the target and you want something more lightweight, yet more efficient than sending one row at a time.
Indeed, most of the methods that I describe in this article can be used for this purpose. Most of them are best fitted to handle "arrays" of single values, although they can be reworked to handle records with several fields. There are two methods that serve this purpose better, and that is XML and a trick with INSERT-EXEC that I discuss in the section Making the List into Many SELECT.
In this article I'm focusing on comma-separated lists, since most questions on the newsgroup are about this scenario. But I will occasionally touch the topic of inserting many rows.
As I've already hinted there are quite a few methods to unpack a list into table. Here I will just give a quick overview of the methods, before I move on to the general considerations.
- The Iterative Method. Looping through a comma-separated list, and returning the elements in a table. While being one of the slower methods, performance is still acceptable for most applications. Its main advantage is that it is easy to understand and easily adaptable to different input formats.
- Using the CLR. SQL 2005 adds the possibility to write table-valued functions (and stored procedures etc) in .Net languages such as C# and Visual Basic. This is one of the fastest methods, and if you are used to C# or VB programming, you will find that this method lends it well to extensions.
- XML. XML is the prime choice when you need to insert many rows at one time. To handle a comma-separated list it is maybe a bit of overkill. If you use the new XQuery methods added in SQL 2005, XML has good performance.
- Using a Table of Numbers to unpack a comma-separated list. A relationally "pure" solution, that is not far behind the CLR in performance.
- Fixed-length Elements. Rather than using a comma-separated list, use a string where all elements have the same length. This is the fastest method of all up to a certain limit (which is quite high). This method also uses a table of numbers.
- Using a Recursive Common Table Expression (CTE). If you want a method that does not require any extra support such as the CLR or a table of numbers, and maybe not even a function, this method is a good choice. Performance is not marvellous, but better than for the iterative method.
- Dynamic SQL. For a list of numbers, it may appear simpler than any other method, but there are several complications with regards to security. And while performance has improved a lot since SQL 2000, this method is slower than most other methods, particularly for long input.
- Making the List into Many SELECT. The list is transformed into many SELECT statements of which the result sets are inserted into a temp table. This method does not really have any advantage for handling comma-separated lists, but it is an interesting alternative when you need to insert many rows.
- Really Slow Methods. Methods that uses charindex, patindex or LIKE. These solutions are just unbelievably slow even for short input.
Most of the methods I present are packaged into functions that take an input parameter which is a list of values and returns a table, like this:
CREATE FUNCTION list_to_table (@list nvarchar(MAX))
RETURNS @tbl TABLE (number int NOT NULL) AS
The reason the methods are in functions is obvious: this permits you to easily reuse the function in many queries. Here I will discuss some considerations about the interface of such functions.
The Input Parameters
In this article the, input parameter is always of the data type nvarchar(MAX). This data type is a new data type in SQL 2005 that can fit up to 2 GB of data, just like the old ntext data type, but nvarchar(MAX) does not have the many quirks of ntext.
I made this choice, because I wanted to make the functions as generally applicable as possible. By using nvarchar the functions can handle Unicode input, and with MAX the functions permit unlimited input. With nvarchar(4000), they would silently yield incorrect results with longer input, which is very bad in my book.
Nevertheless, there is a performance cost for these choices. If you use an SQL collation, you should know that varchar gives you better performance (more on that in a minute). And some operations are slower with the MAX data types. Thus, if you know that your lists will never exceed 8000 bytes and you will only work with your ANSI code page, but you need all performance you can get, feel free to use varchar(8000) instead.
Some of the functions take a second parameter to permit you to specify the delimiter, or in case of fixed-length the element length. In some functions I have opted to hard-code the delimiter, but for all methods that takes a delimiter (save dynamic SQL), you can always add such a parameter. To keep things simple, I have consistently used one-character delimiters. If you need multi-character delimiters, most methods can be extended to handle this.
The Output Table
The output from the functions is tables. For all methods, I've included one function for returning a table of strings, and for some methods also a function for returning a table of integers. (Which is likely to be the most common data type for this kind of lists.) If you have a list of integers and a function that returns strings, you can use it like this:
SELECT ...
FROM tbl t
JOIN list_to_table(@list) l ON t.id = convert(int, t.str)
Similar applies to other data types. You can easily clone a version of the function that has the convert built-in. (However, check the Robustness section for some things to look out for.)
The data type that requires most consideration is actually strings. Return nvarchar or varchar? Obviously when you work with Unicode data, you need to get nvarchar strings back. It may be tempting to always return nvarchar, but for reasons that I will return to in the performance section, you should make sure that you have a varchar string when you join with a varchar column. For some methods, the return table includes both a varchar and an nvarchar column.
In some functions, I also return the position in the list for the list elements. This can be handy when you have two or more lists that are horizontal slices of the same source, so you can say things like:
INSERT tbl(id, col2, col3)
SELECT a.number, b.str, c.str
FROM intlist_to_table(@list1) a
JOIN charlist_to_table(@list2) b ON a.listpos = b.listpos
JOIN charlist_to_table(@list3) c ON a.listpos = c.listpos
That is, this is a way to insert many rows in one go, although it's really the best one. Sometimes this can be OK if you only have two or three columns per row to insert, but as the number of parallel lists grows, it gets out of hand, and you should investigate XML instead. The particular danger with the approach above is that if the lists get out of sync with each other, you will insert incorrect data.
For some methods, the list position can easily be derived from the method itself, for others (but not all) you can use the row_number() function, a very valuable addition to SQL 2005.
It can't be denied that parsing strings is a bit risky. As long as the input plays by the rules everything goes fine, but what happens if it doesn't? A good list-to-table function can protect you from some accidents, but not all. Here are a couple of situations to watch out for.
Delimiter in the Input
Say that you have a couple of city names like this: Berlin, Barcelona, Frankfurt (Main), Birmingham, København. From this you want to compose a comma-separated list that you pass to list-to-table function. With the names listed above, that works fine, but then some joker enters Dallas, TX. Oh-oh.
There are several ways to deal with this problem. One is to use a delimiter that is unlikely to appear in the input data, for instance a control character. Many programs put strings into quotes, so the above list would read "Berlin","Barcelona" etc. This latter format is not supported by any of the functions I present, but you could tweak some of them a bit to get there.
Sometimes you cannot really make any assumption about the delimiter at all, for instance if the input comes from user input or on the wire. In such case you will need to use a method with a general escape mechanism, of which I present one, to wit XML. Or you can avoid the delimiter business completely by using fixed-length strings.
When you work with lists of integers, this is not very likely to be a problem.
Extra Spacing
If you have an input list that goes:
ALFKI, VINET, BERGS,FRANK
Do you want those extra spaces to be included in the data returned by the list-to-table function? Probably not. All functions in this article strips trailing and leading spaces from list elements. However, there are some methods, where this is not possible. (Or more precisely, they are not able to handle inconsistent spacing.)
Illegal Input
Say that you have a function that accepts a list of integers, and the input is 9, 12, a, 23, 12. What should happen?
With no particular coding, SQL Server will give you a conversion error, and the batch will be aborted. If you prefer, you can add checks to your function so that the illegal value is ignored or replaced with NULL.
To focus on the main theme, I have not added such checks to the functions in this article.
Empty Elements
What if your function that accepts a list of integers, is fed the input: 9, 12,, 23, 12. How should that double comma be interpreted? If you just do a simple-minded convert, you will get a 0 back, which is not really good. It would be better to return NULL or just leave out the element. (Raise an error? You cannot raise errors in functions.)
One approach I have taken in some functions in this article is to avoid the problem altogether by using space as delimiter. But since T-SQL does not provide a function to collapse internal spacing, the approach is not without problems. For methods that build on logic of traditional programming, you can easily handle multiple spaces, but for methods that uses a combination of charindex and set-based logic, you would still have to filter out empty elements in the WHERE clause. (Something I have not done in this article.)
While I have conducted performance tests and devoted a long appendix to them, the important performance aspect is not with the methods themselves, but how you use them and how they are packaged. In this section I will look into some important issues.
As I discussed in the Interface section, it appears to be a good choice for a function that unpacks a list of strings to have an nvarchar column in its return table, so it can work with both Unicode and 8-bit data. Functionally, it's sound. Performancewise it can be a disaster. Say that you have:
SELECT ...
FROM tbl t
JOIN list_to_table(@list) l ON t.indexedvarcharcol = l.nvarcharcol
Why is this bad? Recall that SQL Server has a strict data-type precedence, which says that if two values of different data types meet, the one with lower precedence is converted to the higher type. varchar has lower precedence than nvarchar. (Quite naturally, since the set of possible values for a varchar column is a subset of the possible values for an nvarchar column.) Thus, in the query above indexedvarcharcol will be converted to nvarchar. The cost for this depends on the collation of the column.
If the column has a Windows collation, SQL Server will still be able to use the index, because in a Windows collation the possible the varchar values comes in the same order as the nvarchar values. SQL 2005 makes benefit of this by doing a range seek. But this range seek is more expensive than a straight seek, and my tests indicate a doubling of the execution time, which of course is reason enough to avoid it.
The real disaster is if the column has an SQL collation: in this case the index is completely useless, because in an SQL collation, varchar and nvarchar values sort differently. This means that SQL Server will have to find some other way to run the query, most likely one that requires a scan of the entire table. Instead of a sub-second response, it may take minutes to run the query.
Thus, it is instrumental that you have separate functions to return varchar and nvarchar data. Or do as I have done in some functions in this article: have two return columns, one for varchar and one for nvarchar. You can also always try to remember to convert the output column to the appropriate data type – but it's human to forget.
You can find out the collation of a column with sp_help.
There is one more thing to say on this theme, which I will come back to in a minute.
This far I have only said "table function", but there are different kinds of table functions, and you can use them in more than one way. And this can have a very significant impact on performance. Essentially there are five possibilities:
- SQL Inline. An inline table-function in T-SQL is only a function to its syntax. It is in fact a parameterised view. When a query includes an inline function, SQL Server expands the function as if it was a macro, and the optimizer works with the expanded query text. Thus, there is no overhead at all for an inline function. To this category I also count dynamic SQL, although it cannot be packaged into a function as such.
- Multi-statement Function. A multi-statement function, on the other hand, has a body that is executed on its own. A multi-statement function is computed separately and returns its result a table variable. Consequently, there is an overhead for the intermediate storage. There are no statistics associated with table variables, so the optimizer is blind to what the function returns, and it can only apply standard assumptions.
- Opaque Inline. Non-SQL methods such as the CLR or XML fall in between the two T-SQL function types. On the one hand, the optimizer have no information about what they might return, but applies standard assumptions. On the other hand, the results of the operations are streamed into the query. That is, there is no intermediate storage as for multi-statement functions, so they are still inline in that sense.
- Bounce Data over a Table Variable. Rather than joining with the function directly, you can insert the data into a table variable and then join with your target table. If you do this with an T-SQL inline function, this is very similar to rewriting that function as a multi-statement function, including the important note that in lieu of statistics the optimizer can only apply standard assumptions. (I get the impression from my test results that the overhead of a multi-statement function is higher than when you just bounce over a table variable, but I have not examined this in detail.)
- Bounce Data over a Temp Table. Instead of joining with a function directly, you can unpack your list into a temp table. It is often said that temp table incurs more overhead than a table variable, since it's fully logged, but from my tests I am not really able to confirm this statement. More importantly, temp tables have statistics, which means that the optimizer has more information and the odds for a good plan are better.
No method can be implemented in all five ways, but for most you at least have the choices of a multi-statement function or a temp table. Only some methods lend themselves to any of the two inline alternatives.
It may sound from the above that SQL Inline is the most preferable, since there is no intermediate table and the optimizer has full information. Unfortunately, the latter point is not really true. To be able to estimate the best way to access the other tables in the query, the optimizer would need to know is: 1) how many rows will the input string generate 2) the distribution of the values. But the optimizer is not able to do that with the SQL inline functions in this article, because the information is buried too deep in the logic of these functions. (There is one exception: dynamic SQL, which has its own set of problems which makes it less palatable.)
Thus, in practice the optimizer will apply blind assumptions no matter you use SQL inline, opaque inline, a multi-statement function or a table variable. So then it does not matter which one you use? Oh, no. The blind assumptions are different for the different inline methods, and if the function uses an auxiliary table of numbers, the size of that table will affect the blind assumptions. (Because the optimizer has information about that table, is able to use it.) And the blind assumptions for CLR functions and XML are different from each other and from those for T-SQL inline functions. The assumptions for multi-statement functions and table variables appears to be the same vis-à-vis each other, but yet different from the inline methods. With some luck, if the blind assumption for one method leads the optimizer astray, it may work better with another.
Overall, you could say that the T-SQL inline functions have the potential for more "interesting" query plans. Here is an extract from a mail that I received in response to my old article for SQL 2000:
After reading your article 'Arrays and list in SQL server' I tried to use the Fixed-Length Array Elements method in my application. Everything worked fine until I moved the code from a client batch to a stored procedure. When I looked at the query execution plan, I saw that the number of rows retrieved from the Numbers table was over 61 millions ! Instead of starting by joining the Numbers table with the source table to filter out the 500 rows included in the array, it processes the GROUP BY clause on the entire table (121000 rows) and then it uses a nested loop to match each entry with the Number table.
With all other methods but T-SQL inline, the optimizer does not really have much other choice than to first compute the table from the list input, but some T-SQL inline functions opens for the possibility for a reverse strategy – which is not likely to be successful. In this particular case, I suggested that he should try a multi-statement function instead, and that resolved his issue. But "interesting" does not always mean bad. Later in the text, I will discuss cases where T-SQL inline gives unexpectedly good performance on multi-CPU machines, because the optimizer finds a parallel plan.
What about temp tables then? Initially, when the optimizer compiles a stored procedure, it makes a blind assumption about a temp table. But if a sufficiently amount of data is inserted into the table this will trigger auto-statistics, and this will in its turn trigger a recompile of the statements where the temp table is referenced. This recompile is both a blessing and a curse. It's a blessing, because it gives the optimizer a second chance to find a better plan. But if the optimizer comes up with the same plan as it had before, it was just wasted cycles. In SQL 2000 where the entire procedure would always be recompiled this could be really expensive. SQL 2005 has statement recompile, so the effect may not be equally drastic.
At this point the reader may feel both confused and uncomfortable over all these complications. In practice, it is not really that bad. Often, these blind assumptions work fairly well, particularly if your input lists are small. So go for a method that you think fit your needs, and stick with it as long as you don't run into problems. When do you run into bad performance, come back and read this section again, to get an idea of what alternatives you should try. One rule of thumbs is that the bigger the input list is, the more reason you have to consider using a temp table.
Here is an overview which strategies that are possible with which methods:
| |
T-SQL Inline |
Opaque Inline |
Multi-Statement |
Table Variable |
Temp Table |
| Iterative Method |
No |
No |
Yes |
Yes |
Yes |
| CLR |
No |
Yes |
No |
Yes |
Yes |
| XML |
No |
Yes |
No |
Yes |
Yes |
| Table of Numbers |
Yes |
No |
Yes |
Yes |
Yes |
| Fixed-Length |
Yes |
No |
Yes |
Yes |
Yes |
| Recursive CTE |
Yes |
No |
Yes |
Yes |
Yes |
| List to SELECT |
No |
No |
No |
Yes |
Yes |
| Dynamic SQL |
Yes |
No |
No |
No |
No |
| Real Slow |
Yes |
No |
No |
No |
No |
Consider this procedure:
CREATE PROCEDURE test_sp @str nvarchar(MAX) AS
SELECT t.col1, t.col2
FROM testtbl t
JOIN inline_split_me(@str) c ON t.id = c.Value
As you can guess from the name, inline_split_me is an inline function. SQL Server MVP Tony Rogerson discovered that there is a problem here: the query plan for the procedure is not put into cache as it should be, which means that the procedure is compiled each time is executed. This is bad not only for the cost of the unnecessary compilations: if there are multiple simultaneous calls to this procedure, these calls will be serialised, because if one process (re)compiles a procedure, no other process can execute that procedure but will be blocked until compilation has completed.
There is no good reason for this behaviour; an SQL Server Developer has confirmed to me that this is a bug in SQL 2005 and nothing else.
This problem appears only with T-SQL inline functions, not with multi-statement functions. And it only appears if the input variable is of a MAX data type. (Or the arcane text, ntext or image.) If the input variable is a regular nvarchar(4000), the issue does not arise.
A workaround is to copy the input parameter to a local variable like this:
CREATE PROCEDURE test_sp @str nvarchar(MAX) AS
DECLARE @copy nvarchar(MAX)
SELECT @copy = @str
SELECT t.col1, t.col2
FROM testtbl t
JOIN inline_split_me(@copy) c ON t.id = c.Value
On the other hand, bouncing the data over a temp table or a table variable does not seem to help.
The problem does not appear with all inline functions in this article, only with those for table-of-numbers and recursive CTE, but not with those for fixed length. However, this may be due to pure luck, so I encourage you to examine with SQL Profiler whether your procedure is victim to this bug. Enable all events for stored procedures. If everything is alright, you should see a CacheInsert event the first time you run the procedure, and on subsequent executions, you should see a CacheHit event. If you see a CacheMiss every time but no CacheInsert, you have encountered this bug.
In a previous section I discussed the problems with joining nvarchar and varchar. When I ran my performance tests and investigated some unexpected results; I discovered a second problem of a similar kind. Consider this:
SELECT ...
FROM tbl t
JOIN list_to_table(@list) l ON t.indexednvarcharcol = l.nvarcharmaxcol
The list-to-table function is here written in such a way that its return type is nvarchar(MAX). This too leads to an implicit conversion of the indexed column. It may not be apparent that it has to be that way at first sight, but when SQL Server evaluates an expression, it always works with the same data type for all operands. And apparently, nvarchar(4000) and shorter is a different data type from nvarchar(MAX). The result of the implicit conversion is not fatal. The optimizer applies a range-seek operator and is still able to use the index, but nevertheless there is an overhead. When I initially ran my tests, I had not observed this issue, and my inline functions returned nvarchar(MAX) (of the simple reason that the input string was nvarchar(MAX)). As a consequence, my tests in some cases seemed to indicate that inline functions performed worse than the corresponding multi-statement solutions.
Presumably, most of the time when you use list-to-table functions for a list of strings, the strings are short, just a few characters long. Therefore, there is all reason to make sure that your list-to-table function returns a regular varchar or nvarchar. Particularly, this means that for inline functions, you should make sure that the return value is explicitly converted to regular (n)varchar. You will see this in all inline functions in this article.
All functions in this article uses nvarchar both for parameters, output and internal variables. If you never work with Unicode data, you may think that you should rewrite the functions to use varchar instead, assuming that 8-bit characters are faster for SQL Server to work with than the 16-bit Unicode characters.
This may or may not be the case, depending on which collation you are using. As I discussed above under varchar vs. nvarchar, there are two sorts of collations: Windows collations and SQL collations. If you use a Windows collation, you get a slight reduction in performance if you use varchar rather than nvarchar. This is due to that with a Windows collation, the Unicode rules and routines are always employed internally, so all using varchar buys you is some extra conversions.
On the other hand, with an SQL collation you can get some 30 % improvement in execution time with using varchar instead. This is because SQL collations are 8-bit only, for which there exist a separate set of 8-bit only routines, and the rules for an 8-bit character set are far simpler than those for Unicode. If you have an SQL collation and use nvarchar, you are in fact using a Windows collation under the cover.
Note here that the exact gain depends on the type of operation. 30 % is what you can expect from a plain equality test. There are situations where the difference between varchar and nvarchar in an SQL collation can be as much as a factor of 7. We will look at such case in the section on really slow methods.
But there is an option for better performance with nvarchar: use a binary collation. Now, if you opt to use a binary collation throughout your database, you will have to accept that all comparisons are case-insensitive, that sorting is funky, particularly for other languages than English. So for most applications, a binary collation is not a viable option. However, there exists a second possibility: force the collation for a certain expression. I have employed this throughout this article where it makes sense. You will see a lot of things like:
charindex(@delimiter COLLATE Slovenian_BIN2, @list, @pos + 1)
Since @delimiter is cast to an explicit collation, this also happens with @list. (This is discussed in Books Online in the topic Collation Precedence.) When using charindex to find a delimiter, odds are good that you are looking for the exact delimiter and you have no need for case- or accent-insensitive searches. Thus, using a binary collation in this situation does not lead to any loss in functionality. When I tested this for the iterative method, I got some 10 % improvement in execution time.
(Why Slovenian? And why BIN2? It's Slovenian because my test data is a list of Slovenian words for a spelling dictionary. BIN2 is a new type of binary collations in SQL 2005. I did not really grasp the difference to the old binary collations, but they appeared to be the better choice. Anyway, it should not matter much which binary collation you use.)
Most examples in this article work with a single list being passed as a parameter into a function, and this is probably the most common case. But sometimes you find yourself working with a table like this one:
| Modelid |
Colours |
| A200 |
Blue, Green, Magenta, Red |
| A220 |
Blue, Green, Magenta, Red, White |
| A230 |
Blue, Green, Magenta, Red, Cyan, White, Black |
| B130 |
Brown, Orange, Red |
| B150 |
Yellow, Brown, Orange, Red |
That is, the available colours for a model appear as a comma-separated list. Let me directly emphasise that this is an extremely poor design that violates a very basic principle of relational databases: no repeating groups. As a consequence of this, tables with this design are often very painful to work with. If you encounter this design, you should seriously consider changing the data model, so there is a sub-table with one row for each model and colour. Then again, to get there, you need to be able to split up these lists into rows.
In SQL 2000, it was not possible to call a table-valued function and pass a table column as parameter, but SQL 2005 adds the APPLY operator that permits you to d
Thursday, June 19, 2008 11:44:12 AM