Importing flatfiles to a SQL server with a varying number of columns

Ever been as frustrated as I when importing flatfiles to a SQL Server and the format suddenly changes in production?

The mostly used integration tools (like SSIS) are very dependent on the correct, consistent and same metadata when working with flatfiles.

I’ve come up with a solution that I would like to share with you.

When implemented, the process of importing flatfiles with changing metadata is handled in a structured, and most important, flawless way. Even if the columns change order or existing columns are missing.

Background

When importing flatfiles to SQL server almost every standard integration tool (including TSQL bulkload) requires fixed metadata from the files in order to work with them.

This is quite understandable, as the process of data transportation from the source to the destination needs to know where to map every column from the source to the defined destination.

Let me make an example:

A source flatfile table like below needs to be imported to a SQL server database.

This file could be imported to a SQL Server database (in this example named FlatFileImport) with below script:

create table dbo.personlist (
	[name] varchar(20),
	[gender] varchar(10),
	[age] int,
	[city] varchar(20),
	[country] varchar(20)
);

BULK INSERT dbo.personlist
FROM 'c:\source\personlist.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);

select * from dbo.personlist;

The result:

If the column ‘Country’ would be removed from the file after the import has been setup, the process of importing the file would either break or be wrong (depending on the tool used to import the file) The metadata of the file has changed.

-- import data from file with missing column (Country)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistmissingcolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist;

With this example, the import seems to go well, but upon browsing the data, you’ll see that only one row is imported and the data is wrong.

The same would happen if the columns ‘Gender’ and ‘Age’ where to switch places. Maybe the import would not break, but the mapping of the columns to the destination would be wrong, as the ‘Age’ column would go to the ‘Gender’ column in the destination and vice versa. This due to the order and datatype of the columns. If the columns had the same datatype and data could fit in the columns, the import would go fine – but the data would still be wrong.

-- import data from file with switched columns (Age and Gender)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistswitchedcolumns.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
Importing flatfiles to a sql server

When importing the same file, but this time with an extra column (Married) – the result would also be wrong:

-- import data from file with new extra column (Married)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistextracolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist; 

The result:

Above examples are made with pure TSQL code. If it was to be made with an integration tool like SQL Server Integration Services, the errors would be different and the SSIS package would throw more errors and not be able to execute the data transfer.

The cure

When using the above BULK INSERT functionality from TSQL the import process often goes well, but the data is wrong with the source file is changed.

There is another way to import flatfiles. This is using the OPENROWSET functionality from TSQL.

In section E of the example scripts from MSDN, it is described how to use a format file. A format file is a simple XML file that contains information of the source files structure – including columns, datatypes, row terminator and collation.

Generation of the initial format file for a curtain source is rather easy when setting up the import.

But what if the generation of the format file could be done automatically and the import process would be more streamlined and manageable – even if the structure of the source file changes?

From my GitHub project you can download a home brewed .NET console application that solves just that.

If you are unsure of the .EXE files content and origin, you can download the code and build your own version of the GenerateFormatFile.exe application.
Another note is that I’m not hard core .Net developer, so someone might have another way of doing this. You are very welcome to contribute to the GitHub project in that case.

The application demands inputs as below:

Example usage:

generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;

Above script generates a format file in the directory c:\source\ and names it personlistFormatFile.xml.

The content of the format file is as follows:

The console application can also be called from TSQL like this:

-- generate format file
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;

If by any chance the xp_cmdshell feature is not enabled on your local machine – then please refer to this post from Microsoft: Enable xp_cmdshell

Using the format file

After generation of the format file, it can be used in TSQL script with OPENROWSET.

Example script for importing the ‘personlist.csv’

-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlist.csv',  
	formatfile='c:\source\personlistformatfile.xml',
	firstrow=2
	) as t;
 
select * from dbo.personlist_bulk;

This loads the data from the source file to a new table called ‘personlist_bulk’.

From here the load from ‘personlist_bulk’ to ‘personlist’ is straight forward:

-- load data from personlist_bulk to personlist
truncate table dbo.personlist;
 
insert into dbo.personlist (name, gender, age, city, country)
select * from dbo.personlist_bulk;
 
select * from dbo.personlist;
 
drop table dbo.personlist_bulk;

Load data even if source changes

Above approach works if the source is the same every time it loads. But with a dynamic approach to the load from the bulk table to the destination table it can be assured that it works even if the source table is changed in both width (number of columns) and column order.

For some the script might seem cryptic – but it is only a matter of generating a list of column names from the source table that corresponds with the column names in the destination table.

-- import file with different structure
-- generate format file
if exists(select OBJECT_ID('personlist_bulk')) drop table dbo.personlist_bulk
 
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlistmissingcolumn.csv -o personlistmissingcolumnformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;
 
 
-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlistmissingcolumn.csv',  
	formatfile='c:\source\personlistmissingcolumnformatfile.xml',
	firstrow=2
	) as t;
 
-- dynamic load data from bulk to destination
declare @fieldlist varchar(8000);
declare @sql nvarchar(4000);
 
select @fieldlist = 
				stuff((select 
					',' + QUOTENAME(r.column_name)
						from (
							select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist'
							) r
							join (
								select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist_bulk'
								) b
								on b.COLUMN_NAME = r.COLUMN_NAME
						for xml path('')),1,1,'');
 
print (@fieldlist);
set @sql = 'truncate table dbo.personlist;' + CHAR(10);
set @sql = @sql + 'insert into dbo.personlist (' + @fieldlist + ')' + CHAR(10);
set @sql = @sql + 'select ' + @fieldlist + ' from dbo.personlist_bulk;';
print (@sql)
exec sp_executesql @sql
 

The result is a TSQL statement what looks like this:

truncate table dbo.personlist;
insert into dbo.personlist ([age],[city],[gender],[name])
select [age],[city],[gender],[name] from dbo.personlist_bulk;

The exact same thing would be able to be used with the other source files in this demo. The result is that the destination table is correct and loaded with the right data every time – and only with the data that corresponds with the source. No errors will be thrown.

From here there are some remarks to be taken into account:

  1. As no errors are thrown, the source files could be empty and the data updated could be blank in the destination table. This is to be handled by processed outside this demo.

Further work

As this demo and post shows it is possible to handle dynamic changing flat source files. Changing columns, column order and other changes, can be handled in an easy way with a few lines of code.

Going from here, a suggestion could be to set up processes that compared the two tables (bulk and destination) and throws an error if X amount of the columns are not present in the bulk table or X amount of columns are new.

It is also possible to auto generate missing columns in the destination table based on columns from the bulk table.

Only your imagination sets the boundaries here.

Summary – importing flatfiles to a SQL server

With this blogpost I hope to have given you inspiration to build your own import structure of flatfiles in those cases where the structure might change.

As seen above the approach needs some .Net skills – but when it is done and the console application has been build, it is a matter of reusing the same application around the different integration solutions in your environment.

Happy coding 🙂

External links:

BULK INSERT from MSDN: https://msdn.microsoft.com/en-us/library/ms188365.aspx

OPENROWSET from MSDN: https://msdn.microsoft.com/en-us/library/ms190312(v=sql.130).aspx

XP_CMDSHELL from MSDN: https://msdn.microsoft.com/en-us/library/ms175046.aspx

GitHub link: https://github.com/brianbonk/GenerateFormatFile/releases/tag/v2.0

Undelete object from database

undelete

Have you ever tried to delete an object from the database by mistake or other error? You can undelete object – sometimes.

Then you should read on in this short post.

I recently came across a good co-worker of mine who lost one of the views on the developer database. He called me for help.

Fortunately the database was in FULL RECOVERY mode – so I could extract the object from the database log and send the script to him for his further work that day. I think I saved him a whole day of work…

The undelete object script

Here is the script I used:

select 
	convert(varchar(max),substring([RowLog Contents 0], 33, LEN([RowLog Contents 0]))) as [Script]
from 
	fn_dblog(NULL,NULL)
where 1=1
	and [Operation]='LOP_DELETE_ROWS' 
	and [Context]='LCX_MARK_AS_GHOST'
and [AllocUnitName]='sys.sysobjvalues.clst'

Ready, SET, go – how does SQL server handle recursive CTE’s

This blogpost will cover some of the basics in recursive CTE’s and explain the approach done by the SQL Server engine.

First of all, a quick recap on what a recursive query is.

Recursive queries are useful when building hierarchies, traverse datasets and generate arbitrary rowsets etc. The recursive part (simply) means joining a rowset with itself an arbitrary number of times.

A recursive query is defined by an anchor set (the base rowset of the recursion) and a recursive part (the operation that should be done over the previous rowset).

The basics in recursive CTE

A recursive query helps in a lot of scenarios. For instance, where a dataset is built as a parent-child relationship and the requirement is to “unfold” this dataset and show the hierarchy in a ragged format.

A recursive CTE has a defined syntax – and can be written in general terms like this – and don’t run way because of the general syntax – a lot of examples (in real code) will come:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from set_operation(result_from_previous, mytable) as result_from_current

Or rewritten in another way:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from result_from_previous.*
 join mytable
 on condition(result_from_previous)

Another way to write the query (using cross apply):

select result_from_current.*
from result_from_previous
cross apply (
select result_from_previous.*
union all
select *
from mytable
where condition(result_from_previous.*)
) as result_from_current

The last one – with the cross apply – is row based and a lot slower than the other two. It iterates over every row from the previous result and computes the scalar condition (which returns true or false). The same row then gets compared to each row in mytable and the current row of result_from_previous. When these conditions are real – the query can be rewritten as a join. Why you should not use the cross apply for recursive queries.

The reverse – from join to cross apply – is not always true. To know this, we need to look at the algebra of distributivity.

Distributivity algebra

Most of us have already learned that below mathematics is true:

X x (Y + Z) = (X x Y) + (X x Z)

But below is not always true:

X ^ (Y x Z) = (X ^ Z) x (X ^ Y)

Or said with words, distributivity means that the order of operations is not important. The multiplication can be done after the addition and the addition can be done after the multiplication. The result will be the same no matter what.

This arithmetic can be used to generate the relational algebra – it’s pretty straight forward:

set_operation(A union all B, C) = set_operation(A, C) union all set_operation(B, C)

The condition above is true as with the first condition in the arithmetic.

So the union all over the operations is the same as the operations over the union all. This also implies that you cannot use operators like top, distinct, outer join (more exceptions here). The distribution is not the same between top over union all and union all over top. Microsoft has done a lot of good thinking in the recursive approach to reach one ultimate goal – forbid operators that do not distribute over union all.

With this information and knowledge our baseline for building a recursive CTE is now in place.

The first recursive query

Based on the intro and the above algebra we can now begin to build our first recursive CTE.

Consider a sample rowset (sampletree):

id parentId name
1 NULL Ditlev
2 NULL Claus
3 1 Jane
4 2 John
5 3 Brian

From above we can see that Brian refers to Jane who refers to Ditlev. And John refers to Claus. This is fairly easy to read from this rowset – but what if the hierarchy is more complex and unreadable?

A sample requirement could be to “unfold” the hierarchy in a ragged hierarchy so it is directly readable.

The anchor

We start with the anchor set (Ditlev and Claus). In this dataset the anchor is defined by parentId is null.

This gives us an anchor-query like below:

recursive CTE 1

Now on to the next part.

The recursive

 After the anchor part, we are ready to build the recursive part of the query.

The recursive part is actually the same query with small differences. The main select is the same as the anchor part. We need to make a self join in the select statement for the recursive part.

Before we dive more into the total statement – I’ll show the statement below. Then I’ll run through the details.

recursive CTE 2

Back to the self-reference. Notice the two red underlines in the code. The top one indicates the CTE’s name and the second line indicates the self-reference. This is joined directly in the recursive part in order to do the arithmetic logic in the statement. The join is done between the recursive results parentId and the id in the anchor result. This gives us the possibility to get the name column from the anchor statement.

Notice that I’ve also put in another blank field in the anchor statement and added the parentName field in the recursive statement. This gives us the “human readable” output where I can find the hierarchy directly by reading from left to right.

To get data from the above CTE I just have to make a select statement from this:

recursive CTE 3

And the results:

recursive CTE 4

I can now directly read that Jane refers to Ditlev and Brian refers to Jane.

But how is this done when the SQL engine executes the query – the next part tries to explain that.

The SQL engines handling

Given the full CTE statement above I’ll try to explain what the SQL engine does to handle this.

The documented semantics is as follows:

  1. Split the CTE into anchor and recursive parts
  2. Run the anchor member creating the first base result set (T0)
  3. Run the recursive member with Ti as an input and Ti+1 as an output
  4. Repeat step 3 until an empty result set is returned
  5. Return the result set. This is a union all set of T0 to Tn

So let me try to rewrite the above query to match this sequence.

The anchor statement we already know:

recursive CTE 5

First recursive query:

recursive CTE 6

Second recursive query:

recursive CTE 7

The n recursive query:

The union all statement:

This gives us the exactly same result as we saw before with the rewrite:

Notice that the statement that I’ve put in above named Tn is actually empty. This to give the example of the empty statement that makes the SQL engine stop its execution in the recursive CTE.

This is how I would describe the SQL engines handling of a recursive CTE.

Based on this very simple example, I guess you already can think of ways to use this in your projects and daily tasks.

But what about the performance and execution plan?

Performance

The execution plan for the original recursive CTE looks like this:

The top part of this execution plan is the anchor statement and the bottom part is the recursive statement.

Notice that I haven’t made any indexes in the table, so we are reading on heaps here.

But what if the data is more complex in structure and depth. Let’s try to base the answer on an example:

From the attached sql code you’ll find a script to generate +20.000 rows in a new table called complextree. This data is from a live solution and contains medical procedure names in a hierarchy. The data is used to show the relationships in medical procedures done by the Danish hospital system. It is both deep and complex in structure. (Sorry for the Danish letters in the data…).

When we run a recursive CTE on this data – we get the exactly same execution plan:

This is also what I would expect as the amount of data when read from heaps very seldom impact on the generated execution plan.

The query runs on my PC for 25 seconds.

Now let me put an index in the table and let’s see the performance and execution plan.

The index is only put on the parentDwId as, according to our knowledge from this article is the recursive parts join column.

The query now runs 1 second to completion and generates this execution plan:

The top line is still the anchor and the bottom part is the recursive part. Notice now the SQL engine uses the non-clustered index to perform the execution and the performance gain is noticeable.

Conclusion

I hope that you’ve now become more familiar with the recursive CTE statement and are willing to try it on your own projects and tasks.

The basics is somewhat straight forward – but beware that the query can become complex and hard to debug as the demand for data and output becomes stronger. But don’t be scared. As I always say – “Don’t do a complex query all at once, start small and build it up as you go along”.

Happy coding.

External links:

The with operator in T-SQL: https://technet.microsoft.com/en-us/library/ms175972.aspx

Recursive CTE’s from MSDN: https://msdn.microsoft.com/en-us/library/ms186243.aspx

Wikipedia on distributivity: https://en.wikipedia.org/wiki/Distributive_property

Use of hierarchyid in SQL Server

I attended a TDWI conference in May 2016 in Chicago. Here I got a hint about the datatype hierarchyid in SQL Server which could optimize and eliminate the good old parent/child hierarchy.

Until then I (and several other in the class) haven’t heard about the hierarchyid datatype in SQL Server. So I had to find out and learn this.

Here’s a blogpost covering some of the aspects of the datatype hierarchyid – including:

  • Introduction
  • How to use it
  • How to optimize data in the table
  • How to work with data in the hierarchy-structure
  • Goodies

Introduction

The datatype hierarchyid was introduced in the SQL Server as from version 2008. It is a variable length system datatype. The datatype can be used to represent a given element’s position in a hierarchy – e.g. an employee’s position within an organization.

The datatype is extremely compact. The storage is dependent in the average fanout (fanout = the number of children in all nodes). For smaller fanouts (0-7) the typical storage is about 6 x Log A * n bits. Where A is the average fanout and n in the total number of nodes in the tree. Given above formula an organization with 100,000 employees and a fanout of 6 levels will take around 38 bits – rounded to 5 bytes of total storage for the hierarchy structure.

Though the limitation of the datatype is 892 bytes there is a lot of room for extremely complex and deep structures.

When representing the values to and from the hierarchyid datatype the syntax is:

[level id 1]/[level id 2]/..[level id n]

Example:

1/7/3

The data between the ‘/ can be of decimal types e.g. 0.1, 2.3 etc.

Given two specific levels in the hierarchy a and b given that a < b means that b comes after a in a depth first order of comparison traversing the tree structure. Any search and comparison on the tree is done this way by the SQL engine.

The datatype directly supports deletions and inserts through the GetDescendant method (see later for full list of methods using this feature). This method enables generation of siblings to the right of any given node and to the left of any given node. Even between two siblings. NOTE: when inserting a new node between two siblings will produce values that are slightly less compact.

Hierarchyid in SQL Server how to use it

Given an example of data – see compete sql script at the end of this post to generate the example used in this post.

hierarchyid in SQL Server 1

The Num field is a simple ascending counter for each level member in the hierarchy.

There are some basic methods to be used in order to build the hierarchy using the hierarchy datatype.

GetRoot method

The GetRoot method gives the hierarchyid of the rootnode in the hierarchy. Represented by the EmployeeId 1 in above example.

The code and result could look like this:

hierarchyid in SQL Server 2

The value ‘0x’ from the OrgPath field is the representation of the string ‘/’ giving the root of the hierarchy. This can be seen using a simple cast to varchar statement:

hierarchyid in SQL Server 3

Building the new structure with the hierarchyid dataype using a recursive SQL statement:

hierarchyid in SQL Server 4

Notice the building of the path after the union all. This complies to the above mentioned syntax for building the hierarchy structure to convert to a hierarchyid datatype.

If I was to build the path for the EmployeeId 10 (Name = ‘Mads’) in above example it would look like this: ‘/2/2/’. A select statement converting the hierarchyid field OrgPath for the same record, reveals the same thing:

hierarchyid in SQL Server 5

Notice the use of the ToString method here. Another build in method to use for the hierarchyid in SQL Server.

GetLevel method

The GetLevel method returns the current nodes level with an index of 0 from the top:

hierarchyid in SQL Server 6

GetDescendant method

This method returns a new hierarchyid based on the two parameters child1 and child2.

The use of these parameters is described in the BOL HERE.

Below is showed some short examples on the usage.

Getting a new hierarchyid when a new employee referring to top manager is hired:

hierarchyid in SQL Server 7

Getting a new hierarchyid when a new hire is referring to Jane on the hierarchy:

hierarchyid in SQL Server 8

Dynamic insert new records in the hierarchy table – this can easily be converted into a stored procedure:

hierarchyid in SQL Server 9

Notice the new GetAncestor method which takes one variable (the number of steps up the hierarchy) and returns that levels Hierarchyid. In this case just 1 step up the hierarchy.

More methods

There are several more methods to use when working on a hierarchy table – as found on BOL:

GetDescendant – returns a new child node of a given parent. Takes to parameters.

GetLevel – returns the given level for a node (0 index)

GetRoot – returns a root member

ToString – converts a hierarchyid datatype to readable string

IsDescendantOf – returns boolean telling if a given node is a descendant of given parent

Parse – converts a string to a hierarchyid

Read – is used implicit in the ToString method. Cannot be called by the T-SQL statement

GetParentedValue – returns node from new root in case of moving a given node

Write – returns a binary representation of the hierarchyid. Cannot be called by the T-SQL statement.

Optimization

As in many other scenarios of the SQL Server the usual approach to indexing and optimization can be used.

To help on the usual and most used queries I would make below two indexes on the example table:

hierarchyid in SQL Server 10

But with this like with any other indexing strategy – base it on the given scenario and usage.

Goodies

So why use this feature and all the coding work that comes with it?

Well – from my perspective – it has just become very easy to quickly get all elements either up or down from a given node in the hierarchy.

Get all descendants from a specific node

If I would like to get all elements below Jane in the hierarchy I just have to run this command:

hierarchyid in SQL Server 11

Think of the work you would have to do if this was a non hierarchy structured table using only parent/child and recursice sql if the structure was very complex and deep.

I know what I would choose.

Conclusion

As seen above the datatype hierarchyid can be used to give order to the structure of a hierarchy in a way that is both efficient and fairly easy maintained.

If one should optimize the structure even further, then the EmployeeId and the ManagerId could be dropped as the EmployeeId is now as distinct as the OrgPath and can be replaced by this. The ManagerId is only used to build the structure – but this is now also given by the OrgPath.

Happy coding…

External references:

Hierarchyid from MSDN

Using hierarchyid from TechNet

Enlarge AdventureWorks2012

sql-banner

Just recently I had to have a big datawarehouse solution to test some performance optimization using BIML.
I could use the AdventureWorks2012 database, but I needed the clean datawarehouse tables in order to have minimum data maintennance when testing the BIML scripts.

I could not find it, and figures out it was faster to make my own.

So heavily inspired by this post from Jonathan Kehayias (blog), I’ve made a script that can be used to enlarge the dbo.FactInternetSales table.

The script creates a new table called dbo.FactInternetSalesEnlarged and copies data from dbo.FactInternetSales into it with a randomizer. Exploding the data to a 100 times bigger table – est. 6 mio rows.

Get the script here:

EnlargeAdventureWorksDW2012

Happy coding 🙂

SSIS expressions I often use

expression

If either you are doing your SSIS expressions by hand or using the BIML framework, you’ve came across the expressions and the expression-builder.

This is a helper list, with my most often used, and wich are allways forgotten when I need them, of my commonly used SSIS expressions.

Strings

Filename from fully qualified pathstring

  • RIGHT([FilePath],FINDSTRING(REVERSE([FilePath]),”\\”,1) – 1)

Folderpath from fully qualified pathstring

  • SUBSTRING([FilePath], 1, LEN([FilePath]) – FINDSTRING(REVERSE([FilePath] ), “\\” ,1 ) + 1)

Foldername from fully qualified pathstring

  • TOKEN[FilePath],”\\”,TOKENCOUNT([FilePath],”\\”) – 1)

This is only for SSIS2012 and onwards. The TOKEN and TOKENCOUNT expressions are not in prior versions of SQL Server

For prior versions of SQL Server:

  • SUBSTRING([FilePath],LEN([FilePath]) – FINDSTRING(REVERSE([FilePath]),”\\”,2) + 2,(LEN([FilePath]) – FINDSTRING(REVERSE([FilePath]),”\\”,1)) – (LEN([FilePath]) – FINDSTRING(REVERSE([FilePath]),”\\”,2)) – 1)

Replace empty strings

With SQL Server 2012 the new REPLACENULL function was implemented, making it alot easier to replace empty values.

  • REPLACENULL([ColumnName], [replace value])

For earlier versions of SQL Server

  • LEN([ColumnName]) == 0 ? [replace value] : [ColumnName]
  • ISNULL([ColumnName]) ? [replace value] : [ColumnName]

Date and time

Date from datetime

If you want to remove the time element in a datetime object, you should cast it to DT_DBDATE. But because that datatype is very inconvenient to use, you should cast it back to the original datatype. That will set the time to 0:00.

  • (DT_DATE)(DT_DBDATE)@[User::datetimeVariable]
  • (DT_DATE)(DT_DBDATE)[datetimeColumn]
  • (DT_DBTIMESTAMP)(DT_DBDATE)GETDATE()

Time from datetime

If you want to remove the date element in a datetime object, you should cast it to DT_DBTIME. And optional cast it to a string.

  • (DT_STR,8,1252)(DT_DBTIME)@[User::datetimeVariable]
  • (DT_STR,8,1252)(DT_DBTIME)[datetimeColumn]
  • (DT_STR,8,1252)(DT_DBTIME)GETDATE()

First day of the current month

If you want to get the first day of the current month, you take the current datetime and deduct the current day number (minus 1). Optional you can remove the time part:

  • DATEADD(“d”, -DAY(GETDATE()) + 1, GETDATE())
  • (DT_DBTIMESTAMP)(DT_DBDATE)DATEADD(“d”, -DAY(GETDATE()) + 1, GETDATE())

Last day of the current month

If you want to get the last day of the current month, you add 1 month and deduct the current day number. Optional you can remove the time part:

  • DATEADD(“d”, -DAY(GETDATE()), DATEADD(“m”, 1, GETDATE()))
  • (DT_DBTIMESTAMP)(DT_DBDATE)DATEADD(“d”, -DAY(GETDATE()), DATEADD(“m”, 1, GETDATE()))

And if you realy want the last second of the current month 30-06-2011 23:59:59

  • DATEADD(“s”, -1,DATEADD(“d”, -DAY(GETDATE()) + 1, DATEADD(“m”, 1, (DT_DBTIMESTAMP)(DT_DBDATE)GETDATE())))

Weeknumber of the month

1-june-2012 is weeknumber 23 in the year, but weeknumber 1 of the month june 2012.

  • (DATEPART(“ww”,[YourDate]) – DATEPART(“ww”,DATEADD(“d”, -DAY([YourDate]) + 1, [YourDate]))) + 1

Datetime as concatenated string

  • (DT_STR, 4, 1252)DATEPART(“yyyy”, @[System::StartTime]) +
    RIGHT(“0” + (DT_STR, 2, 1252)DATEPART(“mm”, @[System::StartTime]), 2) +
    RIGHT(“0” + (DT_STR, 2, 1252)DATEPART(“dd”, @[System::StartTime]), 2) +
    RIGHT(“0” + (DT_STR, 2, 1252)DATEPART(“hh”, @[System::StartTime]), 2) +
    RIGHT(“0” + (DT_STR, 2, 1252)DATEPART(“mi”, @[System::StartTime]), 2) +
    RIGHT(“0” + (DT_STR, 2, 1252)DATEPART(“ss”, @[System::StartTime]), 2)

Dynamic partitioning tabular cube

new_rubiks_cube_5

In every project on Business Intelligence there comes a time when the code needs to be deplyed to the production environment.
No more development, no more manual work. It is time for dynamic partitioning.

But what about the partitions on the tabular cube? Do we really need to tell and learn the DBA how to handle that on a periodic plan?

The answer is simple: No!

Thanks to the XMLA language, the DMV’s for SSAS instances (both tabular og multidimensionel) and SSIS we can do the partitioning dynamic based on the current data in the datawarehouse.

In below examble I’ve made partitions for every month, but as always you need to take the current architecture, dataflow, data deliveries etc into account when creating your sollution.

Here we go:

We need a table in order to keep track on the partitions and their metadata. Also a table to hold data from existing partitions in the Tabular cube:

In SSIS I’ve created a dataflow from the SSAS Tabular instance with the list of partitions on the table I want to partition to the table ExitingPartitionsList.
We can do this thanks to the DMV’s for SSAS which also works for Tabular instances – see this link for further information.

TableID

The SQL-statement to get list of partitions (reference to msdn):

The tableID is found in the tables properties (right-click and choose ‘Properties’):

UpdateExistingPartitions

The controlflow looks like this. I hope it is somewhat self-explainable:

ExistingPartitions

Now we need to take a look at the naming convention of the partitions. There is a need for the partition-name to have a suffix that tells the span of the partition. I’ve choosen [Tablename]-[fromdate]-[todate]. And in order to get usable data from the previous dataflow, I’ve made a view as below:

Giving this output:

My facttable (Måleraflæsninger) has a field that is used to filter the partition. In this case it is ‘datekey’. In order to get all possible partitions that are needed for the project I’ve made this view:

Take notice that I’ve made the dates end at last day of the month. This is going to be used to generate the view for the partitions later on.
Based on these two views we can now generate the list of partitions that needs to be created in the Tabular project:

Now I need to generate a set of XMLA’s. One to create a partition and one to process a partition. The easiest way to get these is to script them from the GUI in the SSAS Tabular instance.

1: Right-click the table that I’m working with and selecting ‘Partitions…’

CreatePartitionXMLA_1

2: Click ‘New partition’

CreatePartitionXMLA_2

3: Here specify the name – remember the convention – and the SQL statement. Here it is important to remember the filter criteria for the partitions. In this case it is one partition for every month.

CreatePartitionXMLA_3

4. Finally click the Script buttom and ‘Script Action to New Query Window’

CreatePartitionXMLA_4

Result – the areas I’ve highlighted are the ones that we need to parameterize in SSIS – more to come on that part in a bit.

CreatePartitionXMLA_5

The same way I’ve made a XMLA script to process the partition – the highlighted area is, again, to be parameterized later:

ProcesPartitionXMLA_1

Now we need to go to SSIS and make the logic and steps to accuire the dynamic partitioning.

First of all we need to make sure that all partitions that should be in the model also exists, and if they do not exists, we’øll create them.

The SSIS project now needs some variables as listed below:

VariablesList

The scope varies from project to project.

After all the variables has been defined, we need to make the two XMLA-variables as Expressions.

In the Expression builder add below codes to the respective variable:

CreatePartitionXMLA – replace the meta-data with the one that matches your sollution:

ProcessPartitionXMLA – replace the meta-data with the one that matches your sollution:

Now we are all set and just need to build the package.

The focus is now on a container like this:

ContainerCreateAndProcessMissingPartitions

The steps are to get all the missing partitions from our defined view and loop the result with both Create partition, Process partition and Insert data to PartitionLog.

Step 1:

Define a SQL task with the following statement:


Define the output to ‘Full resultset’ and map the result to the variable MissingPartitions.

Add a ForeahLoopContainer and define the container to reference the MissingPartitions object as a ADO source and map the data to the variables as below:

ForeachLoopCreatePartitionParameterMapping

Inside this container add two Analysis Services Execetute DDL Tasks and define a Analysis Services Connectione that matches the environment.

The first DDL (Create Partition) is defined like this:

CreatePartitionDDL

The second DDL (Process Partition) is defined like this:

ProcessPartitionDDL

The last step is to add a record in the PartitionLog table – add a SQL task with this statement:

And map the parameters like this:

InsertRecord toPartitionLogParameterMapping

Now, when the package runs, it will get a dataset of missing partitions, loop the dataset and create and process the partitions dynamically. At the end it creates a record in the partitionlog to keep track of this.

The last thing we need to do is to add a container to process the latest partition every time the package executes.

We need to build this:

ContainerProcessCurrentPartition

Again, add a SQL task and define it to get data from the view with current partition we created earlier:

Map the ‘Full resulset’ to the variable PartitionName.

The foreach loop container must be defined to use this variable and the data mapped like this:

ForeachLoopProcessLatestPartition

Add a Anaysis Services Exectute DDL task and define it to use the variable ProcesPartionXMLA. We can reuse the expression as it is defined as expression and uses the same logic in the expression.

Finally add a SQL task with below code:

Map the parameters like this:

UpdatePartitionLog

And there it is. All done.

Now every time the package is executed it will see to that missing partitions is created (in this case for every month start) and processed.
And it will make sure that the latest partition is updated with the latest data.

The processing time now takes very short time to do, as the only data that is processed is the latest one. Of course the first time the package is run it will create all the partitions and process them.

The whole code from above can be downloaded here:

Blog-code_DynamicPartitions

Finally – MCSE Business Intelligence

mcse-business-intelligence-certification

I got it! The title of MCSE Business Intelligence.

My, for now, last test was 70-467 – Designing Business Intelligence Solutions with Microsoft SQL Server 2012.

I can now call myself MSCE.

The last two tests:

  • 70-466 – Implementing Data Models and Reports with Microsoft SQL Server 2012
  • 70-467 – Designing Business Intelligence Solutions with Microsoft SQL Server 2012

They were not that hard compared to the first 3 to become MSCA.
There is, for now, no training kit to help you on your way. The only way is to know the ‘skills measured’, find your gaps and fill them with knowledge and practise.

The last test today brought case studies. 3 of them to be exact.

A case study  is a set of ‘documents’ explaining the background for a given scenario – incl. technical-, business-requrements, some architecture and a datamodel/server-infrastructure.
It is followed by a set of 5-8 questions – for which you can only give the right answer if you know the background in the given scenario.

I learned it the hard way not to read the whole text for every CS – but first read the questions and find the phrases in the text afterwords.

So today is a happy day – I’m done with the certification for Business Intelligence for now.

Preparations for certification: MCSA – SQL Server 2012

20130515-202723.jpgReady, Steady, GO!!

Today I’ve just passed the last certification on the road to MCSA: SQL Server 2012. GREAT!!
What a journey and what a huge pile of books, blog-links and other stuff.

The usual training kits from Microsoft Press has been read and been the base for further reading and study.

A few days ago Microsoft launched their youtube channel with 3 very good videos regarding the path to MCSA: SQL Server 2012.

It brings one 1 hour 15 mins long video for all three certifications.

70-461: Querying SQL Server 2012
70-462: Administering SQL Server 2012
70-463: Implementing Datawarehouse with SQL Server 2012

Bring out the notepad and a cup of your best coffee.

Microsofts Youtube Channel

SQL 2012: Performance monitoring the light and right way

Ever had that awesome SQL tracer build up that does just the right thing for your system to do some performance monitoring – well I know that I had. And someday you might need just the same trace again. But now you need to build it again…

Here comes the feature Extended Events in place. It was first introduced in the SQL 2008 version.
The feature is a good and lightweight event-driven mechanism for collecting information about your SQL server. The Extended Events has a lighter footprint than the old Trace Events. It also has a more programmatic approach to get the events and information that they respond to. The Extended Events has another cool feature – they are stored inside the SQL server and can be turned on and off very simple – even on a job-vise level, rather than the old-fasion way to recreate and rethink the whole semantic for every trance you need.

Also the Trace Events is not guaranteed to exist on future versions of SQL server – so now is the time to learn it before you are forced to do it.

But but, it was all T-SQL-vise to use the feature – for me that was to many arguments and variables to know them by heart. The output from the collection was XML and therefore needed more decoding to be used for optimization. So I never got to use them on a daily basis.

And along came Polly the Extended Events GUI with SQL 2012.

Finally a GUI to the Extended Events handler that ease up your daily work with the SQL server. A set of GUI interfaces have been introduced for dealing with Extended Events: a Wizard, an Extended Events Properties Editor, and a Data Viewer. The Wizard is a handy way to get walked through creating an Extended Events Session, but I’m going to skip past that and talk about the Properties Editor and the Data Viewer. These two interfaces are where you’re going to spend most of your time.

You can find the Extended Events Sessions under the Management folder in SQL Server Management Studio (SSMS). As a replacement (and enhancement, because it does more) for the default trace, a Session comes installed, system_health. You can use this as a great way to learn how Extended Events Sessions are set up because the Session includes many different types of Events, Targets, Filters, and Actions. The same set of windows we’re about to go over can also be used to create new Sessions in addition to edit existing ones.

Defining a Session

For illustration purposes, I’m going to stop the system_health Session while we examine its properties, just so that they’re mostly accessible. All I have to do to make this happen is right-click the session and select Stop Session from the context menu. Right-click the Session again and select Properties, and the Editor opens.

EE_Figure_1

Even though there is details in the window, most of them are self-explainable.

It’s a matter of supplying a Session name and, when you are creating a new Session, you can pull from a list of Session Templates.

The nifty feature is that you have full control of when the Session starts. When I start from scratch, I normally start the event immediately and watches the LIVE DATA in the Data Viewer window (more to come on this later).

Selecting Events

The next page down is the Events page. Clicking it on the left you’ll see all the power and flexibility of Extended Events on display.

EE_Figure_2

Yup, there’s a lot of things and corners here. Let’s take them one at a time.
The left side of the screen is where the main work is done – at the top there is a text field and a drop-down selection box. This lets you search the collection of events in the system. For example, in the above screenshot there’s no search in place, so all events are displayed. If I typed something in the textbox, the events below the bo would the filtered according to my search. This is a great way to quickly find the specific event you might want to use. The drop-down feature can then help you on the way by selecting e.g. ‘Event names only’, ‘Event Fields only’, ‘Event names and Description’ or ‘All’.

On the list itself you can further filter the Extended Events you are interested in. The headings on the columns can be sortet just by clicking on them. On the above picture the Name is sorted (the little triangle to the right of ‘Name’ indicates the sorting order.
The columns Category and Channel are drop-downs that let you filter the list even further.

The names of the Extended Events can be very cryptic. Therefore you’ll find additional description of the selected Event at the lower left corner of the window. Right next to this, there is a list with the fields for the selected Extended Event. These fields can be carefully compared to the old fashion Columns in the Tracer; they are somewhat more inherent and unique to each Event. When capturing and Event you also by default capture most of its Fields. More on these exceptions later.

Once you are satisfied with your selection of Event, you click the large right arrow in the center of the screen to move that Event to the Selected Events list. Removal of items from this list is easy-peasy. Again below this list, you’ll get a second description of the Extended Event you’ve selected.

Configuring Events

There’s even more functionality on this page. Notice that button in the top right corner that says Configure and points to the right? Click that and you get to a whole new set of functionality.

EE_Figure_3

Here you configure the Events. If you want to go back – just click the Select button.

On the left side of the window you have a list of Events, and some extra goodies. First the name, and here you can, again, sort as you like. Below a description of the selected Event. Right next to the name column there is a column (the one with the lightning) showing how many Global Fields, also known as Actions, have been selected. The last column shows whether the Event is filtered. This is a great way to quick and easy identify the Filters and Actions you have. In the above picture, you can see that error_reported has a filter, and none of the other Events have.

To the right there are 3 tabs: Global Fields (Actions), Filter (Predicate) and Event Fields.
The first one – an Action, a global fieldm or an additional column that you can add to any event that you want to capture. An example is the error_reported event that is currently highlighted does not have a database_id Event Field. If you want to capture that field when an error occurs (might be a good idea), you will have to use an Action.

The thing is, that an Action is captured after the Event and is executed synchronously. This means that if there is anything that might cause some performance bottlenecks as a part of your Extended Event capture, here is a likely candidate (among a few others). So instead of calling them Global Fields, which can sound a little to attractive, I would prefer the name i parentheses Actions. This way it is clear that they are different and that you should use them with caution. Selecting a particular Action’s check box adds it to the Extended Event selected on the left side of the screen.

EE_Figure_4

As shown above, clicking the Events (Predicate) tab lets you see and control which events gets captured.
Also, obvious, you can add more lines in an easy-access manner. Each of the columns provides you with a drop-down list, except the final one where you’re expected to enter information. The Fields are from the event and a selection of operating system and SQL Server Fields that you can filter on. The comparison operators are the standard set of equals, less than, and so on, divided up into int and int64. The one thing I’d add is that the more immediate your first filter, the less load these will place on the system. Eliminating all errors below 20 as the first criteria is a good example.

The last tab shows the Event Fields, fields that are unique to this event. Don’t misunderstand me: events have lots of fields in common (such as session_id and database_id, because these are common values within SQL Server), but each event has a preselected list of Fields that apply to it.

EE_Figure_5

Mostly this is just a listing of the Fields and their data types for the selected Event. However, note the event at the top of the list with the check box. By default, all Fields are included, except a few that are more expensive to collect. You have to decide whether you need these Fields when you’re setting up your Extended Events Session. Selecting the check box will include them in the Session.

Data Storage

And that’s it. You’ve now seen how to select a list of Extended Events and to configure those events with Actions, Filters, and Event Fields. Up to now I haven’t talked about where all this information goes. That’s the next page. Clicking the Data Storage page on the left side of the screen via the page listing there opens up a screen like this:

EE_Figure_6

You can define a number of different Targets for your Extended Events Sessions. It really depends on how you’re trying to consume these events. The interesting thing is, if you just want to watch the Session live, you don’t actually have to designate a target here. I don’t want to try to describe what all these are for and what they can do; there’s better documentation for that.

But for most situations, the likely target will be the one selected, event_file. This puts all the output into a file. When you select this Target, you get several properties that you must define at the bottom of the screen. For example, the file name and location are naturally included. You also get to decide how large you want the files to be, if you want them to rollover as they’re filled, and, if they’re rolling over, what the maximum number of files ought to be. It’s a great way to capture information like query performance metrics so that you can later load them into tables and start running reports to identify the longest running query, for example.

Advanced features

The last page is the Advanced features

EE_Figure_7

I’ll not go deep into this except to say that here there is a lot of control over how much impact you’ll allow, or force to your setup of Extended Events. By making adjustments here, you can ensure that you have no losses to Events (and probably a much higher load on your system) or a very lossy process (with a lower impact). Look to Books Online and other resources for when and how to adjust these.

That’s it! Done. Click OK, and now you have a new Session (or you’ve updated an existing one) No T-SQL required. However, it is still possible to script your own Session once its created to be used on other servers if applicable.

 Time to watch SQL-telly

Now I can watch queries as they go by. All I have to do is right-click the Session and select Watch Live Data, and the Data Viewer window opens as in Figure 8. What’s more, you can use the File, Open, and find *.xel files and open them directly into the same viewer.

EE_Figure_8

The window is split in two. At the top are all the events and the timestamp for when they occurred. At the bottom are the Fields for the selected Event in the window above. You can scroll through the various data, and you’ll see everything you need, no XML required.

Scrolling around to find the data you want can be a pain, so, if you like, you can right-click a Field in the lower window and select Show Column In Table from the context menu. Below a few columns displayed in the grid.

EE_Figure_9

There’s a bunch more functionality built into the Data Viewer. You can double-click a Field to open in a new window, which is handy for viewing long T-SQL strings or XML output. If you’re looking at either a stopped Session or a file, you can sort the grid by columns. You can’t do that while watching Live data. Best of all, and I really love this, you can toggle a bookmark on an event so that you can find your way back to that event quickly. OK, maybe that isn’t best, but it’s pretty good. You can also apply a filter based on a value in a Field to show only that one. So, for example, if I only wanted to look at the error_reported Events from above example, I could right-click that column where that is the value and select Filter On This Value from the context menu.

ee_topbar

If you’re looking at a Session, not a file, you can start and stop the session, pick which window you want shown, edit your filters, and do grouping and aggregation. Actually that’s pretty slick, too. What I’ve done in Figure 11 is group by the Event name Field so that I’m seeing all of a particular event as a set.

From my small run of events I have three different types: error_reported with 18 Events, rpc_completed with 4, and sql_batch_completd with 41. I’ve expanded the rpc_completed to show the individual calls. This does bring out one issue with Extended Events that I found to be a little problematic for gathering query metrics. Note that the batch_text Field is NULL for all the rpc_completed events. This is because the rpc_completed Event has a statement Field that is the equivalent to the batch_text field in the sql_batch_completed Event. A slight pain, and something to be aware of. However, you’re compensated by being able to get the object_name Field, which means you can immediately group all your stored procedure calls by the name of the procedure, no worries about trying to parse out the T-SQL information to remove parameters. That’s a huge win.

A little extra feature is the ability search in the results as shown below.

EE_Figure_11

You can define what field or fields you want to search through. In my case, I chose Table Columns. You can put different criteria in and even make use of wildcards and regular expressions. I left that off in my case so that I could find the literal string ‘SELECT *’.

There’s still more that I haven’t covered, but you get the idea. With SQL Server 2012 you get the ability to do fully fledged data exploration through your Extended Event output.

Summary of performance monitoring

There’s much more to learn about Extended Events, and you’re probably still going to use a lot of T-SQL code when working with them. But to get started, you now have a GUI that has everything you need to set up, control, and maintain Extended Events. You also have a GUI that allows you to consume and report on the data collected from Extended Events. Many of the reasons people had in the past for not using Extended Events should now be eliminated.

en_USEnglish
da_DKDanish en_USEnglish