MS default collation is still “old school” (SQL_Latin1_General_CP1_CI_AS when installed on an EN-US OS). Most people don’t bother to change it during installation, which is a shame.
Going forward, I have been building all new servers in the more modern, Windows, collation: Latin1_General_CI_AS in our EN-US case.
What’s wrong with “old school”, you ask?
In my mind, and our environment, plenty. Rather than try to explain it myself, I’ll refer you to this very capable post. The most critical difference (IMHO) is here:
The SQL_Latin1_General_CP1_CI_AS collation is a SQL collation and the rules around sorting data for unicode and non-unicode data are different. The Latin1_General_CI_AS collation is a Windows collation and the rules around sorting unicode and non-unicode data are the same. A Windows collation as per this example can still use an index if comparing unicode and non-unicode data albeit with a slight performance hit. A SQL collation cannot do this [..] and comparing nvarchar data to varchar removes the ability to perform an index seek.
Yuck. Especially yuck when your environment contains code from vendors that routinely mix up varchar and nvarchar variables and column definitions, because different modules were written by different development teams with different standards.
If SQL collations are so bad, why didn’t Microsoft change the default?
It comes down to the problem of migrating existing databases from old servers to new ones. Those databases are going to have a different collation than their new server. If #temp tables (which take their collation from server default) are joined to “real” tables in this situation, with string values and no COLLATE clauses on the join clauses or (more sensibly) the CREATE TABLE #temp definition, the joins will fail.
It would be nice if we could trust vendor code, or even our own developers, to remember to use COLLATE DATABASE_DEFAULT clauses every time they wrote a CREATE TABLE #temptable statement.
Don’t make me laugh (or cry).
If you’re confused by this problem, or what I just wrote to describe the problem, the same very capable post has a worked example for you. So do many, many other blogs — now that there is a “solution” for this in 2012, everybody is jumping on the bandwagon to write about it, and they all provide simple worked examples.
The base situation isn’t any different in SQL 2012 than previous versions, so two things (at least) stay the same:
- The default collation during installation is still SQL-old school, to avoid people putting existing databases on new servers and suddenly having lots of errors;
- You should still (IMHO) change the server collation if you want to move forward, ever, to get the goodness of Windows-new style collations.
Here’s where some things get better now
Because SQL 2012 introduces partially contained databases, you can bring your existing databases onto your new servers, with the new collations, by doing nothing more than this, after the restore in the target environment:
ALTER DATABASE [Restored_Target]
SET CONTAINMENT=PARTIAL WITH NO_WAIT;
That’s huge, when you consider the amount of work it takes to actually move a database’s objects thoroughly to the server’s collation. (You’ll find more-or-less thorough instructions from Microsoft for how to do this here.)
Yes, it’s a lot easier. No, it’s not a thorough replacement.
When you follow a strategy of using containment to handle mis-matched collations, consider this:
- Your moved database is still suffering from the disabilities inherent in the old-school collation, as described in that very capable post.
- Any multi-database joins you write, which reference newer databases in the newer collation belonging to the server, are still going to require COLLATE clauses on joins. You’ve just fixed #temptable problems. If you don’t change it now… when will you? Seems like the move to the new server is the best occasion you’re going to have.
What’s the alternative?
A lot of people have written about how to move a database thoroughly into a new collation, without following the more-or-less thorough instructions from Microsoft, which require all user data to be bulk-exported and re-added to the new database copies. (A complete bulk export/import can be incredibly painful for large databases, and is difficult to explain in-house, when nobody really understands why you are bothering to go through the effort of changing collation anyway.)
When I bit the bullet and started to do this myself, I started with some instructions I found here. I gradually adapted those scripts and enlarged them to handle the cases I found:
- dropping and re-creating CHECK CONSTRAINTs (see this for a way of automating if you want)
- dropping and re-creating computed columns (I just used sys.computed_columns for this, for the databases that used them)
- scripting and/or refreshing views (you’ve got to drop and re-add the schema-bound ones; if you run into trouble with encrypted data try sp_refreshview for those, leaving them in place)
- dropping and re-creating scalar- and table-valued functions
- dropping and re-creating statistics, where these statistics proved a final roadblock to thorough conversion
- keeping notes for each vendor’s db as I created and tested these scripts, to get the optimal order in which I should run them (do table-valued functions reference scalar-valued functions? What’s referenced in the views? etc)
- keeping notes on all integration points, so that, when all is ready after tests of the conversion, I can add temporary COLLATE clauses into multi-database joins, for any as-yet-unmigrated dbs joined to this one.
As you can see, this is a royal pain to do.
For most of the databases in my care, however, once it’s done, it’s done. I archive the scripts I used, and move on to the next lot. I’m happy to say I’m about 75% through moving all the data for which I’m responsible to the newer collation.
Someday all of this will recede from memory like the pain of childbirth.
Is there a place for both techniques? Most definitely.
We have one production database that is regularly copied, via SSIS with a large number of pre- and post- copy steps unique to this application, to test copies on another server. I am hoping to upgrade the test server in the near future. The production server will remain as-is for quite some time.
I can’t possibly include my collation-changing steps in this SSIS package; they are too fragile to be automated successfully. Every time the vendor’s support makes changes to the database, I would be changing all those scripted steps to accommodate new functions, new statistics, etc. This process would change from something that is worry-free, and very critical to us, to something that broke every couple of weeks.
I am absolutely thrilled that I can include a simple ALTER DATABASE [..] SET CONTAINMENT statement into the package and the copies will function successfully on the newer server. Since the package generates all almost all the T-SQL statements at runtime anyway, this is a trivial addition.
So, hurray for SQL Server 2012, and hurray for contained databases.
Just don’t let all the hoopla about this great new feature do all your thinking for you.
3 thoughts on “Containment makes collation better in SQL 2012 – but don’t get confused”
Hi Lisa, what a coincidence!
Our client(You know which one, <s>) reported me an issue that the accent characters cannot be insert into the staging table by the bulk insert task in the Sql Server 2005 SSIS package, I just removed all the Latin1_General_CI_AS collation in the BCP format file to fix this issue.
Character set, encoding, code page, collation, pain in my ass…
Glad to hear you are still on that project!
how are you? Thanks for stopping by..
Changing SQL_Latin1_General_CP1_CI_AS was a hard time, when I had to change the database on
<a href=http://vandskadehjaelp.dk>vandskadehjaelp.dk</a>. – Thanks!
– Knuth Becker