collating charactersets?



I seem to be talking a lot about SQL Server of late (just because of the work that has come up recently) - i promise to get back on topic soon....

I'll try and relate this back to Oracle anyway as it's sometimes useful to compare functionality.

In a previous post i demonstrated a simple technique to run some SQL against lots of SQL Server databases - now i tried to use that same technique again with some different SQL and got an unexpected error:


The SQL i was trying to run was this

select RUN.RN_TESTER_NAME , count(RUN.RN_TESTER_NAME) as run_count, RUN.RN_STATUS , count(RUN.RN_STATUS) as status_count  from DB1.td.run where (run.RN_EXECUTION_DATE >= '2014-01-01' and run.RN_EXECUTION_DATE <= '2014-12-31')  group by  RUN.RN_TESTER_NAME, RUN.RN_STATUS  union all
etc union all
etc

Now the only real difference between this statement and the last one was the aggregate function i wanted to use (count) and the group by.

This will obviously force the database to do more work and will involve use of grouping and sorting operations to be able to do that.

Now this is where it gets interesting...

The collation (characterset in Oracle world) can be defined in a number of places it would seem - either at the instance level (an instance being an install of sql server), a database level (think schema in oracle) or at the column level (as can be done in oracle in a limited way with NCHAR/NVARCHAR2 columns). In this particular case the collation settings were different

For the SQL Server instance it was

SELECT SERVERPROPERTY('collation') 

Which returned Latin1_General_CI_AS

For the databases it was

SELECT
 NAME, 
 COLLATION_NAME
FROM sys.Databases
 ORDER BY DATABASE_ID ASC

Which returned a mix of SQL_Latin1_General_CP1_CI_AS and Latin1_General_CI_AS

So that seems to be the root cause of the problem - the fact that some of them are different and don't match the instance level values. When MSSQL then tries to group/order the items it can't as there is a mix of collation/charactersets.

So what can we do about that?

Well helpfully there is a collation function which converts the character datatypes to the appropriate setting

So as an example this

RUN.RN_TESTER_NAME collate Latin1_General_CI_AS

Makes the RN_TESTER_NAME column return as that collation

So now all i need to do is add this extra snippet into my code that generates the actual report i want to run - like this

SELECT 'select ' + CHAR(39) + name + CHAR(39) +' ,RUN.RN_TESTER_NAME collate Latin1_General_CI_AS, count(RUN.RN_TESTER_NAME) as run_count, RUN.RN_STATUS collate Latin1_General_CI_AS, count(RUN.RN_STATUS) as status_count  from ' + name + '.td.run where (run.RN_EXECUTION_DATE >= '+CHAR(39)+'2014-01-01'+CHAR(39)+' and run.RN_EXECUTION_DATE <= '+CHAR(39)+'2014-12-31'+CHAR(39)+')  group by  RUN.RN_TESTER_NAME, RUN.RN_STATUS  union all'
FROM master..sysdatabases
where name not in ('master','model','msdb','qcsiteadmin_dbV11','tempdb','LiteSpeedLocal')

And the issue is fixed - i can now run the query i want to - which ends up like this:

select 'DB1' ,RUN.RN_TESTER_NAME collate Latin1_General_CI_AS, count(RUN.RN_TESTER_NAME) as run_count, RUN.RN_STATUS collate Latin1_General_CI_AS, count(RUN.RN_STATUS) as status_count  from DB1.run where (run.RN_EXECUTION_DATE >= '2014-01-01' and run.RN_EXECUTION_DATE <= '2014-12-31')  group by  RUN.RN_TESTER_NAME, RUN.RN_STATUS  union all
select 'DB2' ,RUN.RN_TESTER_NAME collate Latin1_General_CI_AS, count(RUN.RN_TESTER_NAME) as run_count, RUN.RN_STATUS collate Latin1_General_CI_AS, count(RUN.RN_STATUS) as status_count  from DB2.td.run where (run.RN_EXECUTION_DATE >= '2014-01-01' and run.RN_EXECUTION_DATE <= '2014-12-31')  group by  RUN.RN_TESTER_NAME, RUN.RN_STATUS  union all
etc 
etc

So it seems charactersets are as much of a problem in MSSQL as they are in Oracle.....



Comments