Blog

Importing flatfiles to a SQL server with a varying number of columns

Ever been as frustrated as I when importing flatfiles to a SQL Server and the format suddenly changes in production?

The mostly used integration tools (like SSIS) are very dependent on the correct, consistent and same metadata when working with flatfiles.

I’ve come up with a solution that I would like to share with you.

When implemented, the process of importing flatfiles with changing metadata is handled in a structured, and most important, flawless way. Even if the columns change order or existing columns are missing.

Background

When importing flatfiles to SQL server almost every standard integration tool (including TSQL bulkload) requires fixed metadata from the files in order to work with them.

This is quite understandable, as the process of data transportation from the source to the destination needs to know where to map every column from the source to the defined destination.

Let me make an example:

A source flatfile table like below needs to be imported to a SQL server database.

This file could be imported to a SQL Server database (in this example named FlatFileImport) with below script:

create table dbo.personlist (
	[name] varchar(20),
	[gender] varchar(10),
	[age] int,
	[city] varchar(20),
	[country] varchar(20)
);

BULK INSERT dbo.personlist
FROM 'c:\source\personlist.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);

select * from dbo.personlist;

The result:

If the column ‘Country’ would be removed from the file after the import has been setup, the process of importing the file would either break or be wrong (depending on the tool used to import the file) The metadata of the file has changed.

-- import data from file with missing column (Country)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistmissingcolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist;

With this example, the import seems to go well, but upon browsing the data, you’ll see that only one row is imported and the data is wrong.

The same would happen if the columns ‘Gender’ and ‘Age’ where to switch places. Maybe the import would not break, but the mapping of the columns to the destination would be wrong, as the ‘Age’ column would go to the ‘Gender’ column in the destination and vice versa. This due to the order and datatype of the columns. If the columns had the same datatype and data could fit in the columns, the import would go fine – but the data would still be wrong.

-- import data from file with switched columns (Age and Gender)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistswitchedcolumns.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
Importing flatfiles to a sql server

When importing the same file, but this time with an extra column (Married) – the result would also be wrong:

-- import data from file with new extra column (Married)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistextracolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist; 

The result:

Above examples are made with pure TSQL code. If it was to be made with an integration tool like SQL Server Integration Services, the errors would be different and the SSIS package would throw more errors and not be able to execute the data transfer.

The cure

When using the above BULK INSERT functionality from TSQL the import process often goes well, but the data is wrong with the source file is changed.

There is another way to import flatfiles. This is using the OPENROWSET functionality from TSQL.

In section E of the example scripts from MSDN, it is described how to use a format file. A format file is a simple XML file that contains information of the source files structure – including columns, datatypes, row terminator and collation.

Generation of the initial format file for a curtain source is rather easy when setting up the import.

But what if the generation of the format file could be done automatically and the import process would be more streamlined and manageable – even if the structure of the source file changes?

From my GitHub project you can download a home brewed .NET console application that solves just that.

If you are unsure of the .EXE files content and origin, you can download the code and build your own version of the GenerateFormatFile.exe application.
Another note is that I’m not hard core .Net developer, so someone might have another way of doing this. You are very welcome to contribute to the GitHub project in that case.

The application demands inputs as below:

Example usage:

generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;

Above script generates a format file in the directory c:\source\ and names it personlistFormatFile.xml.

The content of the format file is as follows:

The console application can also be called from TSQL like this:

-- generate format file
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;

If by any chance the xp_cmdshell feature is not enabled on your local machine – then please refer to this post from Microsoft: Enable xp_cmdshell

Using the format file

After generation of the format file, it can be used in TSQL script with OPENROWSET.

Example script for importing the ‘personlist.csv’

-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlist.csv',  
	formatfile='c:\source\personlistformatfile.xml',
	firstrow=2
	) as t;
 
select * from dbo.personlist_bulk;

This loads the data from the source file to a new table called ‘personlist_bulk’.

From here the load from ‘personlist_bulk’ to ‘personlist’ is straight forward:

-- load data from personlist_bulk to personlist
truncate table dbo.personlist;
 
insert into dbo.personlist (name, gender, age, city, country)
select * from dbo.personlist_bulk;
 
select * from dbo.personlist;
 
drop table dbo.personlist_bulk;

Load data even if source changes

Above approach works if the source is the same every time it loads. But with a dynamic approach to the load from the bulk table to the destination table it can be assured that it works even if the source table is changed in both width (number of columns) and column order.

For some the script might seem cryptic – but it is only a matter of generating a list of column names from the source table that corresponds with the column names in the destination table.

-- import file with different structure
-- generate format file
if exists(select OBJECT_ID('personlist_bulk')) drop table dbo.personlist_bulk
 
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlistmissingcolumn.csv -o personlistmissingcolumnformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;
 
 
-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlistmissingcolumn.csv',  
	formatfile='c:\source\personlistmissingcolumnformatfile.xml',
	firstrow=2
	) as t;
 
-- dynamic load data from bulk to destination
declare @fieldlist varchar(8000);
declare @sql nvarchar(4000);
 
select @fieldlist = 
				stuff((select 
					',' + QUOTENAME(r.column_name)
						from (
							select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist'
							) r
							join (
								select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist_bulk'
								) b
								on b.COLUMN_NAME = r.COLUMN_NAME
						for xml path('')),1,1,'');
 
print (@fieldlist);
set @sql = 'truncate table dbo.personlist;' + CHAR(10);
set @sql = @sql + 'insert into dbo.personlist (' + @fieldlist + ')' + CHAR(10);
set @sql = @sql + 'select ' + @fieldlist + ' from dbo.personlist_bulk;';
print (@sql)
exec sp_executesql @sql
 

The result is a TSQL statement what looks like this:

truncate table dbo.personlist;
insert into dbo.personlist ([age],[city],[gender],[name])
select [age],[city],[gender],[name] from dbo.personlist_bulk;

The exact same thing would be able to be used with the other source files in this demo. The result is that the destination table is correct and loaded with the right data every time – and only with the data that corresponds with the source. No errors will be thrown.

From here there are some remarks to be taken into account:

  1. As no errors are thrown, the source files could be empty and the data updated could be blank in the destination table. This is to be handled by processed outside this demo.

Further work

As this demo and post shows it is possible to handle dynamic changing flat source files. Changing columns, column order and other changes, can be handled in an easy way with a few lines of code.

Going from here, a suggestion could be to set up processes that compared the two tables (bulk and destination) and throws an error if X amount of the columns are not present in the bulk table or X amount of columns are new.

It is also possible to auto generate missing columns in the destination table based on columns from the bulk table.

Only your imagination sets the boundaries here.

Summary – importing flatfiles to a SQL server

With this blogpost I hope to have given you inspiration to build your own import structure of flatfiles in those cases where the structure might change.

As seen above the approach needs some .Net skills – but when it is done and the console application has been build, it is a matter of reusing the same application around the different integration solutions in your environment.

Happy coding 🙂

External links:

BULK INSERT from MSDN: https://msdn.microsoft.com/en-us/library/ms188365.aspx

OPENROWSET from MSDN: https://msdn.microsoft.com/en-us/library/ms190312(v=sql.130).aspx

XP_CMDSHELL from MSDN: https://msdn.microsoft.com/en-us/library/ms175046.aspx

GitHub link: https://github.com/brianbonk/GenerateFormatFile/releases/tag/v2.0

Undelete object from database

undelete

Have you ever tried to delete an object from the database by mistake or other error? You can undelete object – sometimes.

Then you should read on in this short post.

I recently came across a good co-worker of mine who lost one of the views on the developer database. He called me for help.

Fortunately the database was in FULL RECOVERY mode – so I could extract the object from the database log and send the script to him for his further work that day. I think I saved him a whole day of work…

The undelete object script

Here is the script I used:

select 
	convert(varchar(max),substring([RowLog Contents 0], 33, LEN([RowLog Contents 0]))) as [Script]
from 
	fn_dblog(NULL,NULL)
where 1=1
	and [Operation]='LOP_DELETE_ROWS' 
	and [Context]='LCX_MARK_AS_GHOST'
and [AllocUnitName]='sys.sysobjvalues.clst'

Ready, SET, go – how does SQL server handle recursive CTE’s

This blogpost will cover some of the basics in recursive CTE’s and explain the approach done by the SQL Server engine.

First of all, a quick recap on what a recursive query is.

Recursive queries are useful when building hierarchies, traverse datasets and generate arbitrary rowsets etc. The recursive part (simply) means joining a rowset with itself an arbitrary number of times.

A recursive query is defined by an anchor set (the base rowset of the recursion) and a recursive part (the operation that should be done over the previous rowset).

The basics in recursive CTE

A recursive query helps in a lot of scenarios. For instance, where a dataset is built as a parent-child relationship and the requirement is to “unfold” this dataset and show the hierarchy in a ragged format.

A recursive CTE has a defined syntax – and can be written in general terms like this – and don’t run way because of the general syntax – a lot of examples (in real code) will come:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from set_operation(result_from_previous, mytable) as result_from_current

Or rewritten in another way:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from result_from_previous.*
 join mytable
 on condition(result_from_previous)

Another way to write the query (using cross apply):

select result_from_current.*
from result_from_previous
cross apply (
select result_from_previous.*
union all
select *
from mytable
where condition(result_from_previous.*)
) as result_from_current

The last one – with the cross apply – is row based and a lot slower than the other two. It iterates over every row from the previous result and computes the scalar condition (which returns true or false). The same row then gets compared to each row in mytable and the current row of result_from_previous. When these conditions are real – the query can be rewritten as a join. Why you should not use the cross apply for recursive queries.

The reverse – from join to cross apply – is not always true. To know this, we need to look at the algebra of distributivity.

Distributivity algebra

Most of us have already learned that below mathematics is true:

X x (Y + Z) = (X x Y) + (X x Z)

But below is not always true:

X ^ (Y x Z) = (X ^ Z) x (X ^ Y)

Or said with words, distributivity means that the order of operations is not important. The multiplication can be done after the addition and the addition can be done after the multiplication. The result will be the same no matter what.

This arithmetic can be used to generate the relational algebra – it’s pretty straight forward:

set_operation(A union all B, C) = set_operation(A, C) union all set_operation(B, C)

The condition above is true as with the first condition in the arithmetic.

So the union all over the operations is the same as the operations over the union all. This also implies that you cannot use operators like top, distinct, outer join (more exceptions here). The distribution is not the same between top over union all and union all over top. Microsoft has done a lot of good thinking in the recursive approach to reach one ultimate goal – forbid operators that do not distribute over union all.

With this information and knowledge our baseline for building a recursive CTE is now in place.

The first recursive query

Based on the intro and the above algebra we can now begin to build our first recursive CTE.

Consider a sample rowset (sampletree):

id parentId name
1 NULL Ditlev
2 NULL Claus
3 1 Jane
4 2 John
5 3 Brian

From above we can see that Brian refers to Jane who refers to Ditlev. And John refers to Claus. This is fairly easy to read from this rowset – but what if the hierarchy is more complex and unreadable?

A sample requirement could be to “unfold” the hierarchy in a ragged hierarchy so it is directly readable.

The anchor

We start with the anchor set (Ditlev and Claus). In this dataset the anchor is defined by parentId is null.

This gives us an anchor-query like below:

recursive CTE 1

Now on to the next part.

The recursive

 After the anchor part, we are ready to build the recursive part of the query.

The recursive part is actually the same query with small differences. The main select is the same as the anchor part. We need to make a self join in the select statement for the recursive part.

Before we dive more into the total statement – I’ll show the statement below. Then I’ll run through the details.

recursive CTE 2

Back to the self-reference. Notice the two red underlines in the code. The top one indicates the CTE’s name and the second line indicates the self-reference. This is joined directly in the recursive part in order to do the arithmetic logic in the statement. The join is done between the recursive results parentId and the id in the anchor result. This gives us the possibility to get the name column from the anchor statement.

Notice that I’ve also put in another blank field in the anchor statement and added the parentName field in the recursive statement. This gives us the “human readable” output where I can find the hierarchy directly by reading from left to right.

To get data from the above CTE I just have to make a select statement from this:

recursive CTE 3

And the results:

recursive CTE 4

I can now directly read that Jane refers to Ditlev and Brian refers to Jane.

But how is this done when the SQL engine executes the query – the next part tries to explain that.

The SQL engines handling

Given the full CTE statement above I’ll try to explain what the SQL engine does to handle this.

The documented semantics is as follows:

  1. Split the CTE into anchor and recursive parts
  2. Run the anchor member creating the first base result set (T0)
  3. Run the recursive member with Ti as an input and Ti+1 as an output
  4. Repeat step 3 until an empty result set is returned
  5. Return the result set. This is a union all set of T0 to Tn

So let me try to rewrite the above query to match this sequence.

The anchor statement we already know:

recursive CTE 5

First recursive query:

recursive CTE 6

Second recursive query:

recursive CTE 7

The n recursive query:

The union all statement:

This gives us the exactly same result as we saw before with the rewrite:

Notice that the statement that I’ve put in above named Tn is actually empty. This to give the example of the empty statement that makes the SQL engine stop its execution in the recursive CTE.

This is how I would describe the SQL engines handling of a recursive CTE.

Based on this very simple example, I guess you already can think of ways to use this in your projects and daily tasks.

But what about the performance and execution plan?

Performance

The execution plan for the original recursive CTE looks like this:

The top part of this execution plan is the anchor statement and the bottom part is the recursive statement.

Notice that I haven’t made any indexes in the table, so we are reading on heaps here.

But what if the data is more complex in structure and depth. Let’s try to base the answer on an example:

From the attached sql code you’ll find a script to generate +20.000 rows in a new table called complextree. This data is from a live solution and contains medical procedure names in a hierarchy. The data is used to show the relationships in medical procedures done by the Danish hospital system. It is both deep and complex in structure. (Sorry for the Danish letters in the data…).

When we run a recursive CTE on this data – we get the exactly same execution plan:

This is also what I would expect as the amount of data when read from heaps very seldom impact on the generated execution plan.

The query runs on my PC for 25 seconds.

Now let me put an index in the table and let’s see the performance and execution plan.

The index is only put on the parentDwId as, according to our knowledge from this article is the recursive parts join column.

The query now runs 1 second to completion and generates this execution plan:

The top line is still the anchor and the bottom part is the recursive part. Notice now the SQL engine uses the non-clustered index to perform the execution and the performance gain is noticeable.

Conclusion

I hope that you’ve now become more familiar with the recursive CTE statement and are willing to try it on your own projects and tasks.

The basics is somewhat straight forward – but beware that the query can become complex and hard to debug as the demand for data and output becomes stronger. But don’t be scared. As I always say – “Don’t do a complex query all at once, start small and build it up as you go along”.

Happy coding.

External links:

The with operator in T-SQL: https://technet.microsoft.com/en-us/library/ms175972.aspx

Recursive CTE’s from MSDN: https://msdn.microsoft.com/en-us/library/ms186243.aspx

Wikipedia on distributivity: https://en.wikipedia.org/wiki/Distributive_property

Use of hierarchyid in SQL Server

I attended a TDWI conference in May 2016 in Chicago. Here I got a hint about the datatype hierarchyid in SQL Server which could optimize and eliminate the good old parent/child hierarchy.

Until then I (and several other in the class) haven’t heard about the hierarchyid datatype in SQL Server. So I had to find out and learn this.

Here’s a blogpost covering some of the aspects of the datatype hierarchyid – including:

  • Introduction
  • How to use it
  • How to optimize data in the table
  • How to work with data in the hierarchy-structure
  • Goodies

Introduction

The datatype hierarchyid was introduced in the SQL Server as from version 2008. It is a variable length system datatype. The datatype can be used to represent a given element’s position in a hierarchy – e.g. an employee’s position within an organization.

The datatype is extremely compact. The storage is dependent in the average fanout (fanout = the number of children in all nodes). For smaller fanouts (0-7) the typical storage is about 6 x Log A * n bits. Where A is the average fanout and n in the total number of nodes in the tree. Given above formula an organization with 100,000 employees and a fanout of 6 levels will take around 38 bits – rounded to 5 bytes of total storage for the hierarchy structure.

Though the limitation of the datatype is 892 bytes there is a lot of room for extremely complex and deep structures.

When representing the values to and from the hierarchyid datatype the syntax is:

[level id 1]/[level id 2]/..[level id n]

Example:

1/7/3

The data between the ‘/ can be of decimal types e.g. 0.1, 2.3 etc.

Given two specific levels in the hierarchy a and b given that a < b means that b comes after a in a depth first order of comparison traversing the tree structure. Any search and comparison on the tree is done this way by the SQL engine.

The datatype directly supports deletions and inserts through the GetDescendant method (see later for full list of methods using this feature). This method enables generation of siblings to the right of any given node and to the left of any given node. Even between two siblings. NOTE: when inserting a new node between two siblings will produce values that are slightly less compact.

Hierarchyid in SQL Server how to use it

Given an example of data – see compete sql script at the end of this post to generate the example used in this post.

hierarchyid in SQL Server 1

The Num field is a simple ascending counter for each level member in the hierarchy.

There are some basic methods to be used in order to build the hierarchy using the hierarchy datatype.

GetRoot method

The GetRoot method gives the hierarchyid of the rootnode in the hierarchy. Represented by the EmployeeId 1 in above example.

The code and result could look like this:

hierarchyid in SQL Server 2

The value ‘0x’ from the OrgPath field is the representation of the string ‘/’ giving the root of the hierarchy. This can be seen using a simple cast to varchar statement:

hierarchyid in SQL Server 3

Building the new structure with the hierarchyid dataype using a recursive SQL statement:

hierarchyid in SQL Server 4

Notice the building of the path after the union all. This complies to the above mentioned syntax for building the hierarchy structure to convert to a hierarchyid datatype.

If I was to build the path for the EmployeeId 10 (Name = ‘Mads’) in above example it would look like this: ‘/2/2/’. A select statement converting the hierarchyid field OrgPath for the same record, reveals the same thing:

hierarchyid in SQL Server 5

Notice the use of the ToString method here. Another build in method to use for the hierarchyid in SQL Server.

GetLevel method

The GetLevel method returns the current nodes level with an index of 0 from the top:

hierarchyid in SQL Server 6

GetDescendant method

This method returns a new hierarchyid based on the two parameters child1 and child2.

The use of these parameters is described in the BOL HERE.

Below is showed some short examples on the usage.

Getting a new hierarchyid when a new employee referring to top manager is hired:

hierarchyid in SQL Server 7

Getting a new hierarchyid when a new hire is referring to Jane on the hierarchy:

hierarchyid in SQL Server 8

Dynamic insert new records in the hierarchy table – this can easily be converted into a stored procedure:

hierarchyid in SQL Server 9

Notice the new GetAncestor method which takes one variable (the number of steps up the hierarchy) and returns that levels Hierarchyid. In this case just 1 step up the hierarchy.

More methods

There are several more methods to use when working on a hierarchy table – as found on BOL:

GetDescendant – returns a new child node of a given parent. Takes to parameters.

GetLevel – returns the given level for a node (0 index)

GetRoot – returns a root member

ToString – converts a hierarchyid datatype to readable string

IsDescendantOf – returns boolean telling if a given node is a descendant of given parent

Parse – converts a string to a hierarchyid

Read – is used implicit in the ToString method. Cannot be called by the T-SQL statement

GetParentedValue – returns node from new root in case of moving a given node

Write – returns a binary representation of the hierarchyid. Cannot be called by the T-SQL statement.

Optimization

As in many other scenarios of the SQL Server the usual approach to indexing and optimization can be used.

To help on the usual and most used queries I would make below two indexes on the example table:

hierarchyid in SQL Server 10

But with this like with any other indexing strategy – base it on the given scenario and usage.

Goodies

So why use this feature and all the coding work that comes with it?

Well – from my perspective – it has just become very easy to quickly get all elements either up or down from a given node in the hierarchy.

Get all descendants from a specific node

If I would like to get all elements below Jane in the hierarchy I just have to run this command:

hierarchyid in SQL Server 11

Think of the work you would have to do if this was a non hierarchy structured table using only parent/child and recursice sql if the structure was very complex and deep.

I know what I would choose.

Conclusion

As seen above the datatype hierarchyid can be used to give order to the structure of a hierarchy in a way that is both efficient and fairly easy maintained.

If one should optimize the structure even further, then the EmployeeId and the ManagerId could be dropped as the EmployeeId is now as distinct as the OrgPath and can be replaced by this. The ManagerId is only used to build the structure – but this is now also given by the OrgPath.

Happy coding…

External references:

Hierarchyid from MSDN

Using hierarchyid from TechNet

Update SCD Type 2 dimension in one single transaction using only T-SQL

Merge logo new

Recently I got a request inside my organization to make sure that a SCD Type 2 dimension would keep track of the changes due to requrementes from the business.

This needed to be done in a single transaction in pure T-SQL code.

So – what to do and how to do it. Here’s one way.

The sourcetable looks like this:

SCD type 2 dimension

The request was to keep track of changes in the ManagerId according to CaseId.

I’ve created a SCD2 table like this:

CREATE TABLE [dbo].[CaseProjectManagerHistory](
	[dwid] [bigint] IDENTITY(1,1) NOT NULL,
	[CaseId] [int] NULL,
	[ManagerId] [int] NULL,
	[dwDateFrom] [date] NULL,
	[dwDateTo] [date] NULL,
	[dwIsCurrent] [bit] NULL,
	[dwChangeDate] [date] NULL
)


The fields are as follows:
dwid: Identifier for the table
CaseId: The caseid for the rows
ManagerId: The managerid for the row
dwDateFrom: The date from where the row is actual
dwDateTo: The date to where the row is actual
dwIsCurrent: Boolean that tells if the row is the current one or not
dwChangeDate: The date of the change (if the row has changed since the first write)

If you need to catch up on the history types in a dimension – then take a look at Kennie’s blogppost HERE.

First of all I started out with a merge statement that would insert all the new values not in the table and update the ones that needed update.

Something like this:

merge dbo.CaseProjectManagerHistory as target
	using (select CaseId, ManagerId, cast(getdate() as date) as startDate, datefromparts(2199,1,1) as endDate, 1 as [current], cast(getdate() as date) as changeDate from dbo.[Case]) as source
	on target.CaseId = source.CaseId
	when not matched by target
		then
			insert (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
			values (source.CaseId, source.ManagerId, source.startDate, source.endDate, source.[current], source.changeDate)
	when matched 
		and target.dwIsCurrent = 1
		and exists (select source.CaseId, source.ManagerId
					except
					select target.CaseId, target.ManagerId)
		and target.dwChangeDate <= source.ChangeDate
		and source.changeDate < target.dwDateTo
		then
			update set dwIsCurrent = 0, target.dwChangeDate = source.changeDate, target.dwDateTo = dateadd(d,-1,source.startDate)


Those of you who haven’t tried and worked with a merge-statement – you can get the 101 from BOL here.

But this merge statement only inserts new rows and updates existing rows. The rows that are updated still needs to be in the table in order to fully apply to the SCD 2 rules.

This can be done by using the cluse ‘output’ from the merge-statement and then use the output rows to insert into the same table.

It will look like this:

insert into dbo.CaseProjectManagerHistory_demo (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
select CaseId, ManagerId, startDate, endDate, [current], changeDate 
from (
	merge dbo.CaseProjectManagerHistory_demo as target
	using (
		select 
			CaseId
			,ManagerId
			,cast(getdate() as date) as startDate
			,datefromparts(2199,1,1) as endDate
			,1 as [current]
			,cast(getdate() as date) as changeDate 
		from 
			dbo.[Case]
		where 1=1
			and caseid in (2005,2013,2015,2016,2019,2021,2023,2025,2027,2028)
			) as source
	on target.CaseId = source.CaseId
	when not matched by target -- indsæt nye rækker
		then
			insert (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
			values (source.CaseId, source.ManagerId, source.startDate, source.endDate, source.[current], source.changeDate)
	when matched -- opdater eksisterende rækker
		and target.dwIsCurrent = 1
		and exists (select source.CaseId, source.ManagerId --filtrer kun på rækker der ikke allerede eksisterer i target
					except
					select target.CaseId, target.ManagerId)
		and target.dwChangeDate <= source.ChangeDate
		and source.changeDate < target.dwDateTo
		then
		update set dwIsCurrent = 0, target.dwChangeDate = source.changeDate, target.dwDateTo = dateadd(d,-1,source.startDate) 
		output $action ActionOut, source.CaseId, source.ManagerId, source.startDate, source.endDate, source.changeDate, source.[current]) as mergeOutput
where mergeOutput.ActionOut = 'UPDATE';

The mergestatement ‘output’ action is used to insert the same rows to the history table once more. The only change is the ‘end date’.

Happy coding!

Note: I did a short presentation with this at my workplace a few weeks ago, and here Kennie (l, b, t) told me that there is a bug in the merge statement that needs to be taken into account. Read more of that here.

Query store – next generation tool for every DBA

Along with the release of SQL server 2016 CTP 3 now comes the preview of a brand new feature for on premise databases – the Query Store. This feature enables performance monitoring and troubleshooting through a log of executed queries.

This blogpost will cover the aspects of this new feature including:

  • Introduction
  • How to activate it
  • Configuration options
  • What information is found in the Query Store
  • How to use the feature
  • What’s in it for me

Introduction

The new feature Query Store enables everyone with responsibility for SQL server performance and troubleshooting with insight to the actual queries and their query-plans. It simplifies the old way of setting up tracing, logging and event handling to a standard, out of the box, feature.

It enables you to find causes for performance differences due to change in query plans. It also captures historic data from queries, plans, statistics (runtime), and keeps these for later review. This storage is divided into configured time-slots.

All in all, this feature enables you to monitor, capture and analyze performance issues in the server with a few standard settings.

How to activate it

The feature can be enabled in to ways – from SSMS with mouse-clicks or from T-SQL statements.

Enable Query Store from Management Studio

From the object explorer window, right click the database and select Properties.

Here click the Query Store page and change the ‘Enable’ to TRUE:

query store

Enable Query Store from T-SQL statement

In a new query window the following statement enables Query Store on the database ‘QueryStoreDB’:

ALTER DATABASE QueryStoreDB SET QUERY_STORE = ON;

Configuration options

The Query Store has a series of configuration options. All of them can be set from the SQL Server Management Studio with clicks or with T-SQL Statements.

OPERATION_MODE – This can be READ_WRITE or READ_ONLY and states if the Query Store is to collect new data (READ_WRITE) or not to collect data and just hold current data (READ_ONLY).

CLEANUP_POLICY – Specifies through the STALE_QUERY_THRESHOLD_DAYS the number of days for the query store to retain data.

DATA_FLUSH_INTERVAL_SECONDS – Gives the interval in which the data written to the Query Store is persisted to the disk. The frequency, which is asynchronous, for which the transfer occurs is configured via DATA_FLUSH_INTERVAL_SECONDS.

MAX_STORAGE_SIZE_MB – This gives the maximum size of the total data in the Query Store. If and when the limit is reached, the OPERATION_MODE is automatic changed to READ_ONLY and no more data is collected.

INTERVAL_LENGTH_MINUTES – Gives the interval at which the data from runtime execution stats is aggregated. The option gives the fixed time window for this aggregation.

SIZE_BASED_CLEANUP_MODE – When the data in the Query Store gets close to the configured number in MAX_STORAGE_SIZE_MB this option can control the automatic cleanup process.

QUERY_CAPTURE_MODE – Gives the Query Store option to capture all queries or relevant queries based on execution count and resource usage.

MAX_PLANS_PER_QUERY – The maximum number of execution plans maintained for queries.

From SQL Server Management Studio the window look like below when the Query Store is enabled. Also in the bottom of this window, you can see the current disk usage.

The T-SQL syntax for setting the Query Store options is as follows:

ALTER DATABASE <database name> 
SET QUERY_STORE (
    OPERATION_MODE = READ_WRITE,
    CLEANUP_POLICY = 
    (STALE_QUERY_THRESHOLD_DAYS = 30),
    DATA_FLUSH_INTERVAL_SECONDS = 3000,
    MAX_STORAGE_SIZE_MB = 500,
    INTERVAL_LENGTH_MINUTES = 15,
    SIZE_BASED_CLEANUP_MODE = AUTO,
    QUERY_CAPTURE_MODE = AUTO
    MAX_PLANS_PER_QUERY = 1000
);

What information can be found in the Query Store

Specific queries in the SQL server normally has evolving execution plans over time. This due to e.g. schema changes, changes in statistics, indexes etc. Also the plan cache evicts execution plans due to memory pressure. The result is that query performance troubleshooting can be non-trivial and time consuming to resolve.

The Query Store retains multiple execution plans per query. Therefore it can be used to enforce certain execution plans to specific queries. This is called plan forcing (see below for stored procedure to do this).

Prior to SQL 2016 the hint ‘USE PLAN’ was used, but now it is a fairly easy task to enforce a specific execution plan to the query processor.

More scenarios for using the Query Store:

  • Find and fix queries that have a regression in performance due to plan changes
  • Overview of how often and in which context a query has been executed, helping the DBA on performance tuning tasks
  • Overview of the historic plan changes for a given query
  • Identity top n queries (by time, cpu time, io etc) in the past x hours
  • Analyze the use of ressources (io, CPU and memory)

The Query Store contains two stores – a plan store and a runtime stats store. The Plan Store persists the execution plan information and the Runtime Stats Store persists the execution statistics information. Information is written to the two stores asynchronously to optimize performance.

The space used to hold the runtime execution information can grow over time, so the data is aggregated over a fixed time window as per setting made in the configuration.

When Query Store is enabled in the database a set of system views will be ready for queries.

sys.database_query_store_options

sys.query_context_settings

sys.query_store_query

sys.query_store_query_text

sys.query_store_plan

sys.query_store_runtime_stats

sys.query_store_runtime_stats_interval

Furthermore a series of system stored procedures can be called:

sp_query_store_flush_db

sp_query_store_reset_exec_stats

sp_query_store_force_plan

sp_query_store_unforce_plan

sp_query_store_remove_plan

sp_query_store_remove_query

How to use Query Store

The Query Store comes with 4 standard reports as shown below:

For all standard reports is that they can be modified in several ways to fit your personal needs. This is done by selection in drop-downs and point-and-click.

The Regressed Queries gives an overview of the top 25 most resource consuming queries in the last hour. This including the execution plan, a time table to see when and how long the query took to run etc.

The Overall Resource Consumption show 4 charts as standard based on duration, execution count, CPU time and Logical reads

The Top Resource Consuming Queries report shows in the same format as Regressed Queries only non-aggregated and with more details.

The Tracked Queries report show detailed data from a selected query – here you need to find and remember the query id – this can be found, among other ways, from below queries against the Query Store system views.

The data from the Query Store can be accessed from the above described system views. Examples of usage can be found below.

Top 5 queries with the longest average execution time the last hour

SELECT TOP 5
   rs.avg_duration
   ,qt.query_sql_text
   ,rs.last_execution_time 
FROM 
   sys.query_store_query_text AS qt 
   RIGHT JOIN sys.query_store_query AS q 
      ON qt.query_text_id = q.query_text_id 
   RIGHT JOIN sys.query_store_plan AS p 
      ON q.query_id = p.query_id 
   RIGHT JOIN sys.query_store_runtime_stats AS rs 
      ON p.plan_id = rs.plan_id
WHERE  1=1 
   AND rs.last_execution_time > DATEADD(hour, -1, GETUTCDATE())
ORDER BY 
   rs.avg_duration DESC;

Last 10 queries executed on the server

SELECT TOP 10 qt.query_sql_text, q.query_id, 
    qt.query_text_id, p.plan_id, rs.last_execution_time
FROM sys.query_store_query_text AS qt 
JOIN sys.query_store_query AS q 
    ON qt.query_text_id = q.query_text_id 
JOIN sys.query_store_plan AS p 
    ON q.query_id = p.query_id 
JOIN sys.query_store_runtime_stats AS rs 
    ON p.plan_id = rs.plan_id
ORDER BY rs.last_execution_time DESC;

Queries with more than one execution plan

SELECT 
q.query_id
,qt.query_sql_text
,p.query_plan AS plan_xml
,p.last_execution_time
FROM (SELECT COUNT(*) AS count, q.query_id 
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q
    ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan AS p
    ON p.query_id = q.query_id
GROUP BY q.query_id
HAVING COUNT(distinct plan_id) > 1) AS qm
JOIN sys.query_store_query AS q
    ON qm.query_id = q.query_id
JOIN sys.query_store_plan AS p
    ON q.query_id = p.query_id
JOIN sys.query_store_query_text qt 
    ON qt.query_text_id = q.query_text_id
ORDER BY query_id, plan_id;

What’s in it for me

Well I hope that the answer to this is pretty obvious to you after you have read this post 🙂

The Query Store enables any person responsible for database performance to monitor, analyze and keep track of queries, execution plans and resource usage through system views or standard reports from the SQL Server Management Studio.

Conclusion

This new feature Query Store is a great add-on for the DBA (or accidental DBA) that needs to keep the analytical data in a standard form and have availability of query statistics and troubleshooting.

This blogpost is based on the latest CTP of SQL Server 2016 (CTP 3.0) which can be downloaded here:

https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

Many-to-many in SSAS Tabular

Many-to-many in SSAS Tabular

With the release of SQL Server 2016 CTP 3.0 also comes the ability to test the new functionality of Many-to-Many in SSAS Tabular.

This blogpost will cover the aspects of the many-to-many feature from SQL Server 2016 – including:

  • Prerequisites
  • The old way
  • The new way

This post is based on data from the AdventureWorksDW2012 database.

Prerequisites

In order to test the new many-to-many feature from SQL Server 2016 SSAS Tabular you’ll need to download the latest CTP from Microsoft – it can be found here:

http://blogs.technet.com/b/dataplatforminsider/archive/2015/10/28/sql-server-2016-community-technology-preview-3-0-is-available.aspx

Also you’ll need the Visual Studio 2015 and the add-in for Business Intelligence:

https://msdn.microsoft.com/en-us/library/mt204009.aspx

Choose the SSDT October 2015 Preview in Visual Studio for download.

After a bit of waiting with the installation, you are ready to test the functionality.

The old way

Before showing the new (and for me right way) to do the many-to-many in SSAS Tabular, let me first show you how it was done prior to SQL Server 2016 CTP 3.0.

Thanks to the two brilliant guys from SqlBI Marco Russo (T,L) and Alberto Ferrari (T,L) we’ve had below approach for quite a while now.

First of all you need to build a bridge table with the column that links the two tables and build a model like below illustrates.

The m2mKey is a concatination of the SalesOrderNumber and SalesOrderLineNumber as the Tabular still does not have the ability to handle two joins at the same time.

m2m_oldway

Then all measures that need to take the DimSalesReason into account needed to be rewritten with some DAX coding:

Sum of UnitPrice:=CALCULATE(SUM([UnitPrice]);vBridgeSalesReason)

Then the output will look something like this:

m2m_oldway_result

The new way

With the CTP 3.0 release and the SSDT addon for Visual Studio 2015 now this get’s as easy as 1,2,3.

First of all, it is now possible to build a datamodel directly without any bridge tables like this:

m2m_newway

Note the highlighted area – here you can see the many-to-many relationship. This is modelled when creating the relationship in the model like this:

m2m_filterdirection

Remember to select the Filter Direction to << To Both Tables >>.

And that is it!

The result without doing DAX formulas:

m2m_oldway_result

HAPPY CODING 🙂

Behold the new live query stats in SQL Server 2016

With the release of SQL Server 2016 also comes a great new feature to get a live query stats view of the current execution plan for an active query.

This blogpost will cover the aspects of this new feature including:

  • Introduction
  • How to activate
  • How to use and read the output
  • Downsides – if any

Introduction

The introduction of live query plans are in the current release of SQL Server 2016 CTP 2.2 a new feature from Microsoft, which hopefully will be in the final release.

The feature provides real-time insights to the SQL Server engine’s query execution process. This in a visual matter when data flows from one plan operator to the next in the execution. The display will cover the usual elements of an execution plan – this including the number of rows handled, the time spend, progress of the single operators and other well-known statistics of a query execution.

One of the good news in this feature is the ability to show and analyze the query even before it has finished. This is good when debugging complex queries – the operators are shown with their individual performance, giving the DBA or other person responsible for the database a faster view of the places to make the performance-optimization.

Activation

In SQL Server Management studio there is a new option when right clicking the query window – “Include Live Query Statistics”:

live query stats - activation

For some reason, there is no keyboard shortcut to activate that functionality. Maybe this will come in the RTM release of SSMS for SQL Server 2016.

This function can also be activated from the top-menu in SSMS 2016 CTP 2.2:

live query stats - activation 2

Note: this feature also works on SQL Server 2014 SP1 – as the feature relies on underlying DMV’s from this service pack.

If the session running the query has enabled either statistics XML (SET STATISTICS XML ON;) or statistics profile (SET STATISTICS PROFILE ON;) then the Live Query Statistics can also be reached from the activity monitor.

The DBA can also activate a server wide setting to enable Live Query Statistics on all sessions with the extended event query_post_execution_showplan – for more info click HERE [Link: https://msdn.microsoft.com/en-us/library/bb630319.aspx]:

Right click the current query in Active Expensive Queries and choose ‘Show Live Execution Plan’:

live query stats - show plan

The observant reader will now ask – why are the names not the same across the SSMS? Well – I don’t know actually.

Security

A database level SHOWPLAN is required to populate the Live Query Statistics and the server level VIEW SERVER STATE permission in order to see the live statistics.

And voila!

live query stats - gifs animation

The query runs against an enlarged table from the AdventureWorksDW2012 with more than six mill. rows.

…and with a bit more complex query (not optimal query design at all):

live query stats - gif animation 2

The output

The output from the Live Query Statistics can be read like any other execution plan. The operators are the same and the depper statistics can be revealed like usual with hovering the cursor on the single operators:

The data in these statistics will not change as the query runs – you have to move the cursor and hover again to get an updated info on the specific operators.

The DMV used is sys.dm_exec_query_profiles [LINK: https://msdn.microsoft.com/en-us/library/dn223301.aspx] which can be queried and gives the same results in text-form as the graphic animations. But it is a lot more efficient and easier to decode the animations than the text-based results.

The catch

What great new feature – and it also works on SQL Server 2014 SP1 as mentioned earlier. But there is a catch – as always:

  1. If the query are using columnstore indexes then the live window will not show
  2. If the query are using tables that are memory optimized then the live window will not show
  3. NSP (Natively Stored Procedures) are not supported

It only works on SQL Server 2014 SP 1 and onwards. But who isn’t using one of those in production now J.

Conclusion – live query stats

The new feature Live Query Statistics are great for performance tuning of queries and the DBA that want to see the live performance of data loading in the database. The feature works like a charm and is, from my perspective, a nice feature.

I hope this post makes a great start for you to work with the Live Query Statistics. This post is written based on the current CTP of SQL Server 2016 (CTP 2.2) which can be downloaded here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

If the feature is updated in later versions of SQL Server 2016, then this post will be updated accordingly.

Row level security in SQL Server 2016

With the release of SQL Server 2016 comes many great new features. One of these is the implementation of row level security in the database engine.

This blogpost will cover the aspects of this new feature – including:

  • Setup
  • Best practice
  • Performance
  • Possible security leaks

Introduction

The row level security feature was released earlier this year to Azure – following Microsoft’s cloud-first release concept.

A past big issue with the SQL Server engine was that in only understands tables and columns. Then you had to simulate security using secured views, stored procedures or table value functions. The problem here was to make sure that there were no way to bypass them.

With SQL Server 2016, this is no longer an issue.

Now the SQL Server engine handles the security policy in a central controlled area.

row level security 1

Setup and best practice

The Row-level security is based on a special inline table valued function. This function returns either a single row with a 1 or no rows based on the users rights to that specific row.

Let us take an example:

First of all, I’ll create a database and some users to test with:

CREATE DATABASE RowFilter;
GO
 
USE RowFilter;
GO
 
CREATE USER userBrian WITHOUT LOGIN;
CREATE USER userJames WITHOUT LOGIN;
GO

 A table with examples and grant select to the new users:

CREATE TABLE dbo.SalesFigures (
[userCode] NVARCHAR(10),
[sales] MONEY)
GO
 
INSERT  INTO dbo.SalesFigures
VALUES ('userBrian',100), ('userJames',250), ('userBrian',350)
GO
 
GRANT SELECT ON dbo.SalesFigures TO userBrian
GRANT SELECT ON dbo.SalesFigures TO userJames
GO


Now we’ll add a filter predicate function as below:

CREATE FUNCTION dbo.rowLevelPredicate (@userCode as sysname)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS rowLevelPredicateResult
WHERE @userCode = USER_NAME();
GO


This illustrates that the current user must have associated records in order to get any results. Notice that the functions does not have access to the rows itself.

Furthermore the function can contain joins and lookup tables in the where clause – but beware of the performance hit here. Look further down this post for more info.

The last thing to do is to add a filter predicate to the table dbo.SalesFigures:

CREATE SECURITY POLICY UserFilter
ADD FILTER PREDICATE dbo.rowLevelPredicate(userCode)
ON dbo.SalesFigures
WITH (STATE = ON);
GO


That’s it.

Let’s test the results with the users added before:

 EXECUTE AS USER = 'userBrian';
SELECT * FROM dbo.SalesFigures;
REVERT;
GO


This gives me 2 rows:

EXECUTE AS USER = 'userJames';
SELECT * FROM dbo.SalesFigures;
REVERT;
GO
 


This gives me 1 row:

The execution plan shows a new filter predicate when this row level security is added:

To clean up the examples.

USE master;
DROP DATABASE RowFilter;


Performance

Some might ask, “what about the performance – isn’t there a performance hit in this use of functions?”

The short answer is “It depends”.

If you only use a direct filter on the table there is very little to no impact on the performance. The filter is applied directly to the table as any other filter. Compared to the old way of doing the row filter with stored procedures or table valued functions this new approach is performing better.

If you plan to use lookup tables or joins in the predicate function, then you must beware of the helper tables’ indexes and how fast they can deliver data to the function. If the tables are large and slow performing (without indexes etc.) then you will experience bad performance in the row filter function. But that’s just like any other lookup or join that you might do in your solutions.

Best practices

There are some best practices given from Microsoft:

  • It is highly recommended to create a separate schema for the RLS objects (predicate function and security policy).
  • The ALTER ANY SECURITY POLICY permission is intended for highly-privileged users (such as a security policy manager). The security policy manager does not require SELECT permission on the tables they protect.
  • Avoid type conversions in predicate functions to avoid potential runtime errors.
  • Avoid recursion in predicate functions wherever possible to avoid performance degradation. The query optimizer will try to detect direct recursions, but is not guaranteed to find indirect recursions (i.e., where a second function calls the predicate function).
  • Avoid using excessive table joins in predicate functions to maximize performance.

Possible security leaks

This new row filter context can cause information leakage using some carefully codes queries.

Above example can be breached with the following query:

SELECT 1/([sales]-250) FROM dbo.SalesFigures
WHERE Usercode = 'userJames'

This will give an error: Divide by zero error encountered.


This will tell the user trying to access the table, that userJames has a sale of 250. So even though the row filter prevents users from accessing data that they are not allowed, hackers can still try to determine the data in the table using above method.

Conclusion

The new row level security feature has been very much a wanted feature for quite a while now, and with the function now in place, and planned to be released in the RTM version of SQL Server 2016, the DBA’s and other people working with security can use this out-of-the-box.

I hope this post makes a great start for you if you would like to try out the row level security function. Currently the feature is awailable in the latest CTP version (2.2) – which can be downloaded here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

The DBAs guide to stretch database

One of the new features in SQL Server 2016 – and there is a lot – is the ability to stretch the on premise databases to an Azure environment.

This blogpost will cover some of the aspects of this – including:

  • Primarily setup – how to get started
  • Monitoring state of databases that are in ‘stretch mode’
  • Daily work with stretch databases
  • Backup – what’s good to know

With the release of SQL Server 2016, the new feature called stretch database is also released. The feature lets you as a database administrator, make databases stretch (read: copy old data) to an Azure environment. The data is still able to respond to the normal queries that are used, in other way; there is no need to change the current setup for existing applications and other data-contracts to use this feature.

So when is the stretch database something you should consider

  • When you only sometimes need to query the historical data
  • The transactional data that are stored needs all historical data
  • The size of the database tables are growing out of control (but not an issue of bad design – then you need to take other actions…)
  • The backup times are too long in order to make the daily timeslots for maintenance

If you have one or more marks on the above list, then you have a database that are candidate for stretching into Azure.

A typical database in stretch mode are a transactional database with very large amounts of data (more than a billion rows) stored in a small number of tables.

The feature is applied to individual tables in the database – but a need for enabling the feature on database level is a prerequisite.

The limitations

No free goodies without limitations.

There are a list of limitations to a stretch database. Two types of limitations, datatypes and features.

The datatypes that are not supported for stretch database is:

  • filestream
  • timestamp
  • sql_variant
  • XML
  • geometry
  • geography
  • hierarchyid
  • CLR user-defined types (UDTs)

The features that are not supported:

  • Column Set
  • Computed Columns
  • Check constraints
  • Foreign key constraints that reference the table
  • Default constraints
  • XML indexes
  • Full text indexes
  • Spatial indexes
  • Clustered columnstore indexes
  • Indexed views that reference the table

Therefore, it is advisable to have an agreement with your developers if you plan to use the stretch feature. It is more likely that they can code without the above lists, but if they already have implemented features, and needs to work around them, then you are not in good standing for a while.

Security

In order to handle and maintain the stretch feature the current user must be a member of the db_owner group and CONTROL DATABASE permissions is needed for enabling stretch on database level.

Setup – how to get started

First, get an Azure account. If you not already have one. Then…

A small change in sp_configure is needed to get the feature ready.

EXEC sp_configure 'remote data archive' , '1';
 RECONFIGURE;

Enabling the database

It is a prerequisite to enable the database for stretch in order to enable its tables.

It is pretty straight forward – just right-click the database – choose tasks and select ‘Enable database for stretch’:

Then the SQL Server asks you to sign in to your Azure environment.

You need to choose a set of settings for your stretch database in Azure – including:

  • Location for the server
  • Credential for the server
  • Firewall rules

There is a summary page with all info – when complete, just hit ‘Finish’.

Note: the current applications and data-contracts are NOT able to access the data in Azure directly. The only way to access this data is through the normal on premise database. This database then makes the call to access the Azure database or not based on the current configuration and state of migration (see below for help in the latter).

Enabling tables for stretch

As easy as the database, so is the tables.

Right-click the table that you want to stretch – choose ‘Stretch’ and ‘Enable Stretch’.

Ass seen on the screenshot you can also here do the following tasks: Disable, Pause and Resume stretch. All 3 hopefully self-explainable.

Monitoring the state of databases and tables in stretch mode

There is released a list of Dynamic management views (DMVs) and updated to existing catalog views to help with the work of monitoring the state of stretch databases.

The DMV  sys.dm_db_rda_migration_status shows you the current state, in batches and rows, of the data in the stretched tables. For more information, refer to MSDN: sys.dm_db_rda_migration_status.

The catalog views sys.databases and sys.tables now also contains information about the stretch feature on each part respectively. See more as MSDN: sys.databases and sys.tables.

To view the remote databases and tables for stretch data use the two new catalog views sys.remote_data_archive_databases and sys.remote_data_archive_tables.

A big note for the current CTP 2.2 release:
This release only supports the stretch data for entire tables. This meaning that an architectural decision needs to be taken to move historical data to separate tables. I will assume that the final release will contain a query based configuration in order to find and detect the historical data to be moved to the Azure environment.

Backup and restore

The backup and restore is the same as before the stretch feature. The same strategy must be taken and also the same precautions for data storage in Azure.

One must keep in mind that the on premise backup only happens with on premise data.

The restore process adds a step to the checklist when restoring a database with stretch enabled.

Upon the end of restore a connection to the stretched database in Azure must be reestablished with the stored procedure sys.sp_reauthorize_remote_data_archive.

When this SP is executed, the vertical arrow on this illustration is reestablished:

Conclusion

The stretch database feature is a very nice and good feature to get with the release of SQL Server 2016. It enables the DBA to handle historical data and storage capacity without having the consult the developers and/or architects of new solutions. Also current applications and solutions can be configured to use this new feature.

This post makes a great place to begin with the stretch feature of SQL Server 2016. Personally, I hope that the final feature has a bit more configuration to handle the historical data.

en_USEnglish
da_DKDanish en_USEnglish