SQL serveren kan tegne

Vidste du at en SQL server kan tegne ud fra polygoner?

En SQL server har en særlg datatype geometry som reelt blot er polygoner i et koordinatsystem. Overordnet set er geometry en spacial datatype (du kan læse meget mere som spacial datatypes i links nederst).

Polygoner kræver to punkter for at tegne en skreg. De to punkter indgår, på SQL serveren, i et autogenereret koordinatsystem og serveren sørger selv for, ved denne datatype, at tegne en streg imellem de to punkter.

Lad os tegne

Et eksempel på en samling polygoner der tilsammen danner en tegning på SQL serveren, kan ses nedenfor.

SELECT geometry::STPolyFromText('POLYGON ((104 -222, 173 -222, 174 -174, 171 -160, 163 -147, 150 -137, 136 -128, 123 -123, 110 -117, 82 -116, 61 -122, 41 -134, 17 -150, 6 -173, 1 -194, 0 -232, 9 -259, 21 -276, 32 -289, 52 -302, 69 -312, 88 -320, 105 -335, 110 -375, 102 -390, 84 -395, 75 -385, 76 -330, 5 -333, 7 -390, 11 -411, 25 -428, 42 -442, 67 -451, 105 -453, 126 -446, 144 -439, 162 -424, 173 -404, 180 -382, 182 -337, 178 -311, 167 -296, 153 -279, 138 -268, 89 -234, 75 -222, 71 -208, 73 -188, 88 -178, 100 -190, 105 -220, 104 -222))',0) AS Drawing
UNION ALL
SELECT geometry::STMPolyFromText('MULTIPOLYGON (((324 -127, 404 -127, 405 -488, 322 -490, 322 -421, 311 -432, 291 -446, 277 -452, 259 -453, 248 -446, 239 -440, 228 -429, 221 -419, 215 -402, 215 -386, 213 -188, 216 -174, 219 -159, 226 -148, 235 -140, 245 -132, 261 -127, 278 -127, 294 -134, 306 -143, 322 -158, 324 -127)),((296 -191, 300 -186, 308 -182, 319 -188, 324 -196, 322 -384, 317 -391, 311 -395, 305 -395, 300 -395, 293 -388, 296 -191)))',0) AS Drawing
UNION ALL
SELECT geometry::STPolyFromText('POLYGON ((447 -62, 532 -65, 532 -450, 447 -450, 447 -62))',0) AS Drawing

Ved at eksekvere denne, fås en særlig fane i resultatvinduet, kaldet Spacial Results – her kan du nu se en illustration af ovenstående query:

SQL Serveren kan tegne

Kan jeg bruge det til noget?

Både og – for en del år siden, da SQL Server Reporting services var sit højeste og havde sin storhedstid, var det en rigtig god feature at have, når man gerne ville tegne f.eks. kort via dynamiske opslag i en database.

Det kunne være et klikbart kort over Danmark, hvor data til at tegne kortet, ligger direkte på SQL serveren som rækker i en tabel.

SQL serveren er god til at regne på polygoner og spacial data. Eks kan man finde skæringspunkter for to linjer – se eks fra Microsoft nedenfor.

Power BI kan ikke direkte supportere spacial datatypes (endnu – pr. feb 2021). Men mon ikke det kommer på et tidspunkt.

Opsummering

Nu ved du at SQL serveren kan tegne – har du fået blod på tanden til at lære mere om polygoner på SQL serveren og geometry datatypen, så kan du se mere i nedenstående links:

De udokumenterede funktioner du altid har haft brug for

Der findes mange dokumenterede funktioner i SQL serveren – men der findes også nogen udokumenterede funktioner, som andre har gjort et stort arbejde for at finde og dele med resten af verden. Funktioner som ikke er dokumenterede og som dermed ikke er supporteret.

Måske du har prøvet at sidde med noget SQL kode og manglet en given funktion eller metode til at udføre en specifik opgave.

Nogen af dem jeg for noget tid siden faldt over er funktionerne least og greatest.

Rækkebaseret i stedet for kolonnebaseret

udokumenterede funktioner rækker mod kolonner

De to funktioner least og greatest er rækkebaseret modsat dem vi kender omkring størst og mindst, nemlig min og max.

Normalt når man anvender min eller max funktionerne, så får man valideret datasættet for en given kolonne og for den mindste eller største værdi (kan også være tekst) for den givne kolonne.

Eks:

NavnVærdiLængde
Brian1012
Bent3220
Arne45100
Sofus21

Med ovenstående tabel til dette statement:

select Navn, max(Værdi) from tabel group by Navn

Fås følgende resultat:

results:
Arne | 45

Med de nævnte funtioner least og greatst, kan man nu få værdier (størst eller mindst) pr. række.

Eks med ovenstående tabel:

select Navn, least(Værdi, Længde) as least, greatest(Værdi, Længde) as greatest from tabel

Fås følgende resultat:

results:
Navn    |  least   | greatest
Brian   |   10     |   12
Bent    |   20     |   32
Arne    |   45     |  100
Sofus   |    1     |    2

Bemærk at SQL Server mangement studio ikke kan finde intellisense for disse udokumenterede funktioner, så du vil opleve røde markeringer ved anvendelse af dem. Men de kan trygt anvendes alligevel – også i Azure.

Opsummering på udokumenterede funktioner

Der er altså en del flere funktioner at bruge end dem, der er dokumenteret og supporteret af Microsoft.

Der er lidt flere at finde på de nedenstående links:

Hvis du kender til flere funktioner som ikke er dokumenteret af Microsoft, men som du bruger, så skriv dem gerne som kommentar nedenfor.

Uge 2: Data Pages – grundlaget i SQL Serveren

Sidste uge var det hårde indløb og alle de nye termer omkring SQL serveren. Jeg skrev om hvordan en SQL server eksekverer de queries, vi sender afsted til den. Gik du glip af det afsnit, kan du læse det her:

Uge 1 – Hvordan en SQL Server eksekverer en forespørgsel

Jeg berørte også meget kort omkring Data Pages, som er buffere på 8kb (kilobyte). I dette indlæg tager jeg fat i dette begreb yderligere og får beskrevet hvad de betyder for performance på SQL serveren.

Pages er fundamentet for en SQL server, alt omkring en SQL server handler i bund og grund om Pages. Når man gerne vil forbedre performance af de queries, der arbejdes med, forsøger man altid at mindske det antal Page reads SQL serveren skal foretage. Senere i denne blogserie kommer jeg ind på index og at de også er baseret på Pages.

Så at forstå hvad en Page er, giver grobund for god performance tuning på SQL serveren.

Data Pages og deres struktur

En Page på SQL serveren fylder altid 8kb, og der er forskellige typer Pages, som f.eks. Data Pages, Index Pages, System Pages osv. I dette indlæg vil jeg kigge i detaljer på Data Pages, som SQL serveren anvender til at gemme data fra tabellerne. En Data Page består af 3 dele:

  • Page Header
  • Payload
  • Row Offset Array

Page Header er altid 96 bytes i længde på en SQL server (også uanset hvilken type Page det er). Header indeholder oplysninger om f.eks. Page ID og Object ID. Payload området i en Data Page er der hvor data fra tabellerne bliver fysisk gemt. SQL serveren har allokeret 8060 bytes til dette formål. Den skarpe læser, vil nu kunne se, at man kan regne på hvor mange rækkker der kan være i en enkelt Data Page, ved at se på hvor meget en række fylder (alle kolonnernes datatyper i længde (bytes)) og dele den værdi op i de 8060 bytes. Evt. decimaler i den udregning “kappes af”.

Målet med opbygningen af data på en SQL server, er altid at have så mange rækker som muligt i en enkelt Data Page, for som du måske husker fra sidste indlæg, så skal SQL serveren have fat i alle de Data Pages, der skal til for at arbejde med det fornødne data i forbindelse med en given forespørgsel. SQL Serveren kan ikke kun læse en delmængde af en Data Page. I/O operationer sker altid (som minimum) på Page niveau.

Den sidste del af en Data Page er Row Offset Array. Denne lille del af en Data Page indeholder 2 bytes for hver række i pagen med henvisning til hvor i pagen den givne række er placeret. Lidt som et mini-index i en bog.

Den første række på en page begynder altid umiddelbart efter Page Headeren – altså efter byte placering 96. De efterfølgende rækkers placering er givet ud fra rækkernes længde, som beskrevet lidt tidligere.

På nedenstående billede, kan du se opbygningen af en Data Page, som jeg har forsøgt at beskrive ovenfor.

Data Pages

Data Page Internals

En nogen lunde simpel tabel på SQL server kunne se ud som nedenfor.

create table dbo.Medarbejder (
Fornavn char(50)
,Efternavn char(50)
,Adresse char(100)
,Postnummer char(5)
,Alder int
,OprettetDato datetime
)

Ud fra ovennævnte tabel struktur kan vi nu regne på hvor mange rækker der kan være på en Data Page.

En rækker fylder 217 bytes (50+50+100+5+4+8) – int bruger 4 bytes og datetime gemmes reelt som bigint på en DataPage og bruger 8 bytes.

Ved at dele de allokerede 8060 bytes med 217, får vi 37 hele rækker og 31 bytes i overskud. Disse 31 bytes kan ikke bruges til andet, da SQL serveren ikke kan fylde andet i en specifik Data Page, end det er hører til en specifik tabel. Så hvis en række fylde 4031 bytes ville man have 4029 bytes som spildt plads i en Data Page – og denne fylder stadig 8kb på disk.

Tingene ændrer sig, hvis man anvender datatyper med variabel længde eks VARCHAR (alle dem med “var” foran), for her kan SQL serveren gemme kolonnerne på forskellige Data Pages og dermed optimere lidt på Page forbruget.

SQL serveren kan faktisk hjælpe med et samlet overblik over hvor meget plads der er spildt i alle Data Pages, og dermed spildt i det tabel design. Via det såkalde Dynamic Management View (DMV) sys.dm_os_buffer_descriptors, kan man se alle rækker fra resultatsættet som repræsenterer en Data Page (Pas dog på med at eksekvere denne på systemer med meget ram, da resultatsættet vil være tilsvarende stort). I kolonnen free_space_in_bytes vises hvor meget ledig plads der i hver enkelt Data Page.

En god standard forespørgsel at have ved hånden kunne være nedenstående:

select db_name(database_id), sum(free_space_in_bytes)/1024 as 'FreeSpaceInKb'
from sys.dm_os_buffer_descriptors
where database_id <> 32767
group by database_id
order by 2 desc

Opsummering – Data Pages

Jeg håber det har givet lidt mere klarhed over hvad en Data Page på en SQL server er, hvordan man bør tænke over sine tabellers struktur ifm. oprettelsen af dem og hvordan SQL serveren bruger dem til at gemme data på disk.

Skulle du have fået mere blod på tanden for at læse om Data Pages – kan jeg anbefale følgende sider:

Uge 1: Hvordan en SQL Server eksekverer en forespørgsel

Dette er første indlæg i en længere serie om SQL Performnace Tuning. Før vi hopper direkte ud i alle de lækre detaljer omkring performance tuning på en SQL server, er det vigtigt at have grundlaget på plads.

I dette indlæg kigger jeg derfor nærmere på hvordan en SQL server eksekverer en forespørgsel. Det er en vigtig del af det at forstå SQL serverens metoder, da det vil være herpå de næste indlæg vil bygge videre.

Som vist ovenfor er SQL serveren internt delt i en Relational Engine og en Storage Engine. Den største del af Relational Engine er Query Optimizeren (ofte blot kald optimizeren). Det eneste opgave for optimizeren er at generere en fysisk execution plan (mere herom i et senere indlæg) for de forespørgsler vi sender til SQL serveren.

Læsning af data

Alle forespørgsler – eller queries – som vi sender til SQL Serveren passerer igennem Protocol Layer til Command Parser. Command Parser kontrollerer den kode vi sender afsted – f.eks om det er valid TSQL kode og eksiterer alle anvendte tabeller og kolonner i databasen. Resultatet af denne opgave er et såkaldt Query Tree, en træstruktur som repræsenterer den afsendte query. Denne træstruktur bliver så anvendt af optimizeren til at generere en execution plan.

Den færdige execution plan bliver herefter sendt videre til Query Executor. Her er opgaven at udføre de handlinger som execution planen foreskriver. Inden dette sker, gemmes den modtage execution plan i Plan Chachen – her kan SQL serveren nemlig genbruge execution planer fra tidligere. Denne metode med at gemme og genbruge execution plans er på samme tid et meget stærkt og meget farligt koncept af SQL serveren. Mere herom i et senere indlæg omkring Plan Cache.

Efter execution plan er gemt i cachen, begynder query executor at kommunikere med storage engine og eksekverer hver en lille del af execution plan – kaldet operators.

Når data tilgås fra execution plan (dette er det eneste sted vi kan få fat i data) er det Access Methods som kommunikerer med Buffer Manager for at læse specifikke pages – mere om pages i de næste indlæg, og lige nu er det nok at vide at en page er en buffer med data der fylder 8kb hvori data og index er gemt. Det er Buffer Manager der styrer Buffer Pool hvor pages er gemt. Det er Buffer Pool der er den fysiske anvendelse af memory (ram) som man kan se SQL serveren anvender i operativsystemet.

Når en page allerede er gemt i buffer pool, så bliver denne page øjeblikkeligt returneret. Når dette sker er det en Logical Read i SQL serveren. Hvis en page ikke allerede er gemt i buffer pool, så udfører buffer manager en asynkron I/O forespørgsel mod den fysiske disk og ind i buffer pool. Dette er en Physical Read. Under denne asynkrone handling venter querien indtil handlingen er færdig. Dette kaldes Waits og senere i et indlæg kommer jeg ind på dette emne og Wait Statistics.

Så snart en page er læst ind til buffer pool, bliver denne page sendt videre til den proces som forespurgte på den. Når execution plan er færdig, vil resultatet af det behandlede data blive returneret til brugeren eller applikationen gennem Protocol Layer.

Ændring af data

Når man arbejder med TSQL udtryk som ændrer på det eksisterende data (INSERT, DELTE, UPDATE og MERGE), så arbejder storage engine også med Transaction Manageren. Opgaven for denne proces er at skrive records til transaktionsloggen som beskriver de handlinger der sker i transaktionen. Så snart disse records er skrevet til loggen, kan transaktionen blive kørt færdig. Dette betyder også at SQL serveren kun kan være så hurtig som den tilhørende transaktionlog.

Pages som er ændret i hukommelsen på serveren bliver skrevet til disk gennem den såkaldte CHECKPOINT proces. Som udgangspunkt eksekveres checkpoint processen hvert minut og skriver dirty pages fra buffer manageren til disk. En dirty page er en page med ændringer som endnu ikke er skrevet til disk.

Når en page er blevet skrevet til disk markeres den som “clean” igen.

Opsummering – Performance Tuning

Som du sikkert allerede har erfaret af denne indledning til SQL Performance Tuning, så er der rigtig mange ting der sker på SQL serveren, når man eksekverer en forespørgsel. Mange af de ord og termer der er brugt i dette indlæg, vil blive gjort mere klar senere i andre indlæg.

Har du fået mere blod på tanden til at læse videre om hvordan SQL serveren håndterer forespørgsler og databehandling, så vil jeg anbefale dig at læse nogen af nedenstående blogs:

Næste gang gennemgår jeg lidt mere om Data Pages og deres berettigelse i SQL Server.

Har du kommentarer eller andet, er du velkommen til at skrive dem herunder.

Importing flatfiles to a SQL server with a varying number of columns

Ever been as frustrated as I when importing flatfiles to a SQL Server and the format suddenly changes in production?

The mostly used integration tools (like SSIS) are very dependent on the correct, consistent and same metadata when working with flatfiles.

I’ve come up with a solution that I would like to share with you.

When implemented, the process of importing flatfiles with changing metadata is handled in a structured, and most important, flawless way. Even if the columns change order or existing columns are missing.

Background

When importing flatfiles to SQL server almost every standard integration tool (including TSQL bulkload) requires fixed metadata from the files in order to work with them.

This is quite understandable, as the process of data transportation from the source to the destination needs to know where to map every column from the source to the defined destination.

Let me make an example:

A source flatfile table like below needs to be imported to a SQL server database.

This file could be imported to a SQL Server database (in this example named FlatFileImport) with below script:

create table dbo.personlist (
	[name] varchar(20),
	[gender] varchar(10),
	[age] int,
	[city] varchar(20),
	[country] varchar(20)
);

BULK INSERT dbo.personlist
FROM 'c:\source\personlist.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);

select * from dbo.personlist;

The result:

If the column ‘Country’ would be removed from the file after the import has been setup, the process of importing the file would either break or be wrong (depending on the tool used to import the file) The metadata of the file has changed.

-- import data from file with missing column (Country)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistmissingcolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist;

With this example, the import seems to go well, but upon browsing the data, you’ll see that only one row is imported and the data is wrong.

The same would happen if the columns ‘Gender’ and ‘Age’ where to switch places. Maybe the import would not break, but the mapping of the columns to the destination would be wrong, as the ‘Age’ column would go to the ‘Gender’ column in the destination and vice versa. This due to the order and datatype of the columns. If the columns had the same datatype and data could fit in the columns, the import would go fine – but the data would still be wrong.

-- import data from file with switched columns (Age and Gender)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistswitchedcolumns.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
Importing flatfiles to a sql server

When importing the same file, but this time with an extra column (Married) – the result would also be wrong:

-- import data from file with new extra column (Married)
truncate table dbo.personlist;
 
BULK INSERT dbo.personlist
FROM 'c:\source\personlistextracolumn.csv'
WITH
(
	FIRSTROW = 2,
	FIELDTERMINATOR = ';',  --CSV field delimiter
	ROWTERMINATOR = '\n',   --Use to shift the control to next row
	TABLOCK,
	CODEPAGE = 'ACP'
);
 
select * from dbo.personlist; 

The result:

Above examples are made with pure TSQL code. If it was to be made with an integration tool like SQL Server Integration Services, the errors would be different and the SSIS package would throw more errors and not be able to execute the data transfer.

The cure

When using the above BULK INSERT functionality from TSQL the import process often goes well, but the data is wrong with the source file is changed.

There is another way to import flatfiles. This is using the OPENROWSET functionality from TSQL.

In section E of the example scripts from MSDN, it is described how to use a format file. A format file is a simple XML file that contains information of the source files structure – including columns, datatypes, row terminator and collation.

Generation of the initial format file for a curtain source is rather easy when setting up the import.

But what if the generation of the format file could be done automatically and the import process would be more streamlined and manageable – even if the structure of the source file changes?

From my GitHub project you can download a home brewed .NET console application that solves just that.

If you are unsure of the .EXE files content and origin, you can download the code and build your own version of the GenerateFormatFile.exe application.
Another note is that I’m not hard core .Net developer, so someone might have another way of doing this. You are very welcome to contribute to the GitHub project in that case.

The application demands inputs as below:

Example usage:

generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;

Above script generates a format file in the directory c:\source\ and names it personlistFormatFile.xml.

The content of the format file is as follows:

The console application can also be called from TSQL like this:

-- generate format file
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlist.csv -o personlistformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;

If by any chance the xp_cmdshell feature is not enabled on your local machine – then please refer to this post from Microsoft: Enable xp_cmdshell

Using the format file

After generation of the format file, it can be used in TSQL script with OPENROWSET.

Example script for importing the ‘personlist.csv’

-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlist.csv',  
	formatfile='c:\source\personlistformatfile.xml',
	firstrow=2
	) as t;
 
select * from dbo.personlist_bulk;

This loads the data from the source file to a new table called ‘personlist_bulk’.

From here the load from ‘personlist_bulk’ to ‘personlist’ is straight forward:

-- load data from personlist_bulk to personlist
truncate table dbo.personlist;
 
insert into dbo.personlist (name, gender, age, city, country)
select * from dbo.personlist_bulk;
 
select * from dbo.personlist;
 
drop table dbo.personlist_bulk;

Load data even if source changes

Above approach works if the source is the same every time it loads. But with a dynamic approach to the load from the bulk table to the destination table it can be assured that it works even if the source table is changed in both width (number of columns) and column order.

For some the script might seem cryptic – but it is only a matter of generating a list of column names from the source table that corresponds with the column names in the destination table.

-- import file with different structure
-- generate format file
if exists(select OBJECT_ID('personlist_bulk')) drop table dbo.personlist_bulk
 
declare @cmdshell varchar(8000);
set @cmdshell = 'c:\source\generateformatfile.exe -p c:\source\ -f personlistmissingcolumn.csv -o personlistmissingcolumnformatfile.xml -d ;'
exec xp_cmdshell @cmdshell;
 
 
-- import file using format file
select *  
into dbo.personlist_bulk
from  openrowset(
	bulk 'c:\source\personlistmissingcolumn.csv',  
	formatfile='c:\source\personlistmissingcolumnformatfile.xml',
	firstrow=2
	) as t;
 
-- dynamic load data from bulk to destination
declare @fieldlist varchar(8000);
declare @sql nvarchar(4000);
 
select @fieldlist = 
				stuff((select 
					',' + QUOTENAME(r.column_name)
						from (
							select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist'
							) r
							join (
								select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = 'personlist_bulk'
								) b
								on b.COLUMN_NAME = r.COLUMN_NAME
						for xml path('')),1,1,'');
 
print (@fieldlist);
set @sql = 'truncate table dbo.personlist;' + CHAR(10);
set @sql = @sql + 'insert into dbo.personlist (' + @fieldlist + ')' + CHAR(10);
set @sql = @sql + 'select ' + @fieldlist + ' from dbo.personlist_bulk;';
print (@sql)
exec sp_executesql @sql
 

The result is a TSQL statement what looks like this:

truncate table dbo.personlist;
insert into dbo.personlist ([age],[city],[gender],[name])
select [age],[city],[gender],[name] from dbo.personlist_bulk;

The exact same thing would be able to be used with the other source files in this demo. The result is that the destination table is correct and loaded with the right data every time – and only with the data that corresponds with the source. No errors will be thrown.

From here there are some remarks to be taken into account:

  1. As no errors are thrown, the source files could be empty and the data updated could be blank in the destination table. This is to be handled by processed outside this demo.

Further work

As this demo and post shows it is possible to handle dynamic changing flat source files. Changing columns, column order and other changes, can be handled in an easy way with a few lines of code.

Going from here, a suggestion could be to set up processes that compared the two tables (bulk and destination) and throws an error if X amount of the columns are not present in the bulk table or X amount of columns are new.

It is also possible to auto generate missing columns in the destination table based on columns from the bulk table.

Only your imagination sets the boundaries here.

Summary – importing flatfiles to a SQL server

With this blogpost I hope to have given you inspiration to build your own import structure of flatfiles in those cases where the structure might change.

As seen above the approach needs some .Net skills – but when it is done and the console application has been build, it is a matter of reusing the same application around the different integration solutions in your environment.

Happy coding 🙂

External links:

BULK INSERT from MSDN: https://msdn.microsoft.com/en-us/library/ms188365.aspx

OPENROWSET from MSDN: https://msdn.microsoft.com/en-us/library/ms190312(v=sql.130).aspx

XP_CMDSHELL from MSDN: https://msdn.microsoft.com/en-us/library/ms175046.aspx

GitHub link: https://github.com/brianbonk/GenerateFormatFile/releases/tag/v2.0

Undelete object from database

undelete

Have you ever tried to delete an object from the database by mistake or other error? You can undelete object – sometimes.

Then you should read on in this short post.

I recently came across a good co-worker of mine who lost one of the views on the developer database. He called me for help.

Fortunately the database was in FULL RECOVERY mode – so I could extract the object from the database log and send the script to him for his further work that day. I think I saved him a whole day of work…

The undelete object script

Here is the script I used:

select 
	convert(varchar(max),substring([RowLog Contents 0], 33, LEN([RowLog Contents 0]))) as [Script]
from 
	fn_dblog(NULL,NULL)
where 1=1
	and [Operation]='LOP_DELETE_ROWS' 
	and [Context]='LCX_MARK_AS_GHOST'
and [AllocUnitName]='sys.sysobjvalues.clst'

Ready, SET, go – how does SQL server handle recursive CTE’s

This blogpost will cover some of the basics in recursive CTE’s and explain the approach done by the SQL Server engine.

First of all, a quick recap on what a recursive query is.

Recursive queries are useful when building hierarchies, traverse datasets and generate arbitrary rowsets etc. The recursive part (simply) means joining a rowset with itself an arbitrary number of times.

A recursive query is defined by an anchor set (the base rowset of the recursion) and a recursive part (the operation that should be done over the previous rowset).

The basics in recursive CTE

A recursive query helps in a lot of scenarios. For instance, where a dataset is built as a parent-child relationship and the requirement is to “unfold” this dataset and show the hierarchy in a ragged format.

A recursive CTE has a defined syntax – and can be written in general terms like this – and don’t run way because of the general syntax – a lot of examples (in real code) will come:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from set_operation(result_from_previous, mytable) as result_from_current

Or rewritten in another way:

select result_from_previous.*
 from result_from_previous
 union all
 select result_from_current.*
 from result_from_previous.*
 join mytable
 on condition(result_from_previous)

Another way to write the query (using cross apply):

select result_from_current.*
from result_from_previous
cross apply (
select result_from_previous.*
union all
select *
from mytable
where condition(result_from_previous.*)
) as result_from_current

The last one – with the cross apply – is row based and a lot slower than the other two. It iterates over every row from the previous result and computes the scalar condition (which returns true or false). The same row then gets compared to each row in mytable and the current row of result_from_previous. When these conditions are real – the query can be rewritten as a join. Why you should not use the cross apply for recursive queries.

The reverse – from join to cross apply – is not always true. To know this, we need to look at the algebra of distributivity.

Distributivity algebra

Most of us have already learned that below mathematics is true:

X x (Y + Z) = (X x Y) + (X x Z)

But below is not always true:

X ^ (Y x Z) = (X ^ Z) x (X ^ Y)

Or said with words, distributivity means that the order of operations is not important. The multiplication can be done after the addition and the addition can be done after the multiplication. The result will be the same no matter what.

This arithmetic can be used to generate the relational algebra – it’s pretty straight forward:

set_operation(A union all B, C) = set_operation(A, C) union all set_operation(B, C)

The condition above is true as with the first condition in the arithmetic.

So the union all over the operations is the same as the operations over the union all. This also implies that you cannot use operators like top, distinct, outer join (more exceptions here). The distribution is not the same between top over union all and union all over top. Microsoft has done a lot of good thinking in the recursive approach to reach one ultimate goal – forbid operators that do not distribute over union all.

With this information and knowledge our baseline for building a recursive CTE is now in place.

The first recursive query

Based on the intro and the above algebra we can now begin to build our first recursive CTE.

Consider a sample rowset (sampletree):

id parentId name
1 NULL Ditlev
2 NULL Claus
3 1 Jane
4 2 John
5 3 Brian

From above we can see that Brian refers to Jane who refers to Ditlev. And John refers to Claus. This is fairly easy to read from this rowset – but what if the hierarchy is more complex and unreadable?

A sample requirement could be to “unfold” the hierarchy in a ragged hierarchy so it is directly readable.

The anchor

We start with the anchor set (Ditlev and Claus). In this dataset the anchor is defined by parentId is null.

This gives us an anchor-query like below:

recursive CTE 1

Now on to the next part.

The recursive

 After the anchor part, we are ready to build the recursive part of the query.

The recursive part is actually the same query with small differences. The main select is the same as the anchor part. We need to make a self join in the select statement for the recursive part.

Before we dive more into the total statement – I’ll show the statement below. Then I’ll run through the details.

recursive CTE 2

Back to the self-reference. Notice the two red underlines in the code. The top one indicates the CTE’s name and the second line indicates the self-reference. This is joined directly in the recursive part in order to do the arithmetic logic in the statement. The join is done between the recursive results parentId and the id in the anchor result. This gives us the possibility to get the name column from the anchor statement.

Notice that I’ve also put in another blank field in the anchor statement and added the parentName field in the recursive statement. This gives us the “human readable” output where I can find the hierarchy directly by reading from left to right.

To get data from the above CTE I just have to make a select statement from this:

recursive CTE 3

And the results:

recursive CTE 4

I can now directly read that Jane refers to Ditlev and Brian refers to Jane.

But how is this done when the SQL engine executes the query – the next part tries to explain that.

The SQL engines handling

Given the full CTE statement above I’ll try to explain what the SQL engine does to handle this.

The documented semantics is as follows:

  1. Split the CTE into anchor and recursive parts
  2. Run the anchor member creating the first base result set (T0)
  3. Run the recursive member with Ti as an input and Ti+1 as an output
  4. Repeat step 3 until an empty result set is returned
  5. Return the result set. This is a union all set of T0 to Tn

So let me try to rewrite the above query to match this sequence.

The anchor statement we already know:

recursive CTE 5

First recursive query:

recursive CTE 6

Second recursive query:

recursive CTE 7

The n recursive query:

The union all statement:

This gives us the exactly same result as we saw before with the rewrite:

Notice that the statement that I’ve put in above named Tn is actually empty. This to give the example of the empty statement that makes the SQL engine stop its execution in the recursive CTE.

This is how I would describe the SQL engines handling of a recursive CTE.

Based on this very simple example, I guess you already can think of ways to use this in your projects and daily tasks.

But what about the performance and execution plan?

Performance

The execution plan for the original recursive CTE looks like this:

The top part of this execution plan is the anchor statement and the bottom part is the recursive statement.

Notice that I haven’t made any indexes in the table, so we are reading on heaps here.

But what if the data is more complex in structure and depth. Let’s try to base the answer on an example:

From the attached sql code you’ll find a script to generate +20.000 rows in a new table called complextree. This data is from a live solution and contains medical procedure names in a hierarchy. The data is used to show the relationships in medical procedures done by the Danish hospital system. It is both deep and complex in structure. (Sorry for the Danish letters in the data…).

When we run a recursive CTE on this data – we get the exactly same execution plan:

This is also what I would expect as the amount of data when read from heaps very seldom impact on the generated execution plan.

The query runs on my PC for 25 seconds.

Now let me put an index in the table and let’s see the performance and execution plan.

The index is only put on the parentDwId as, according to our knowledge from this article is the recursive parts join column.

The query now runs 1 second to completion and generates this execution plan:

The top line is still the anchor and the bottom part is the recursive part. Notice now the SQL engine uses the non-clustered index to perform the execution and the performance gain is noticeable.

Conclusion

I hope that you’ve now become more familiar with the recursive CTE statement and are willing to try it on your own projects and tasks.

The basics is somewhat straight forward – but beware that the query can become complex and hard to debug as the demand for data and output becomes stronger. But don’t be scared. As I always say – “Don’t do a complex query all at once, start small and build it up as you go along”.

Happy coding.

External links:

The with operator in T-SQL: https://technet.microsoft.com/en-us/library/ms175972.aspx

Recursive CTE’s from MSDN: https://msdn.microsoft.com/en-us/library/ms186243.aspx

Wikipedia on distributivity: https://en.wikipedia.org/wiki/Distributive_property

Use of hierarchyid in SQL Server

I attended a TDWI conference in May 2016 in Chicago. Here I got a hint about the datatype hierarchyid in SQL Server which could optimize and eliminate the good old parent/child hierarchy.

Until then I (and several other in the class) haven’t heard about the hierarchyid datatype in SQL Server. So I had to find out and learn this.

Here’s a blogpost covering some of the aspects of the datatype hierarchyid – including:

  • Introduction
  • How to use it
  • How to optimize data in the table
  • How to work with data in the hierarchy-structure
  • Goodies

Introduction

The datatype hierarchyid was introduced in the SQL Server as from version 2008. It is a variable length system datatype. The datatype can be used to represent a given element’s position in a hierarchy – e.g. an employee’s position within an organization.

The datatype is extremely compact. The storage is dependent in the average fanout (fanout = the number of children in all nodes). For smaller fanouts (0-7) the typical storage is about 6 x Log A * n bits. Where A is the average fanout and n in the total number of nodes in the tree. Given above formula an organization with 100,000 employees and a fanout of 6 levels will take around 38 bits – rounded to 5 bytes of total storage for the hierarchy structure.

Though the limitation of the datatype is 892 bytes there is a lot of room for extremely complex and deep structures.

When representing the values to and from the hierarchyid datatype the syntax is:

[level id 1]/[level id 2]/..[level id n]

Example:

1/7/3

The data between the ‘/ can be of decimal types e.g. 0.1, 2.3 etc.

Given two specific levels in the hierarchy a and b given that a < b means that b comes after a in a depth first order of comparison traversing the tree structure. Any search and comparison on the tree is done this way by the SQL engine.

The datatype directly supports deletions and inserts through the GetDescendant method (see later for full list of methods using this feature). This method enables generation of siblings to the right of any given node and to the left of any given node. Even between two siblings. NOTE: when inserting a new node between two siblings will produce values that are slightly less compact.

Hierarchyid in SQL Server how to use it

Given an example of data – see compete sql script at the end of this post to generate the example used in this post.

hierarchyid in SQL Server 1

The Num field is a simple ascending counter for each level member in the hierarchy.

There are some basic methods to be used in order to build the hierarchy using the hierarchy datatype.

GetRoot method

The GetRoot method gives the hierarchyid of the rootnode in the hierarchy. Represented by the EmployeeId 1 in above example.

The code and result could look like this:

hierarchyid in SQL Server 2

The value ‘0x’ from the OrgPath field is the representation of the string ‘/’ giving the root of the hierarchy. This can be seen using a simple cast to varchar statement:

hierarchyid in SQL Server 3

Building the new structure with the hierarchyid dataype using a recursive SQL statement:

hierarchyid in SQL Server 4

Notice the building of the path after the union all. This complies to the above mentioned syntax for building the hierarchy structure to convert to a hierarchyid datatype.

If I was to build the path for the EmployeeId 10 (Name = ‘Mads’) in above example it would look like this: ‘/2/2/’. A select statement converting the hierarchyid field OrgPath for the same record, reveals the same thing:

hierarchyid in SQL Server 5

Notice the use of the ToString method here. Another build in method to use for the hierarchyid in SQL Server.

GetLevel method

The GetLevel method returns the current nodes level with an index of 0 from the top:

hierarchyid in SQL Server 6

GetDescendant method

This method returns a new hierarchyid based on the two parameters child1 and child2.

The use of these parameters is described in the BOL HERE.

Below is showed some short examples on the usage.

Getting a new hierarchyid when a new employee referring to top manager is hired:

hierarchyid in SQL Server 7

Getting a new hierarchyid when a new hire is referring to Jane on the hierarchy:

hierarchyid in SQL Server 8

Dynamic insert new records in the hierarchy table – this can easily be converted into a stored procedure:

hierarchyid in SQL Server 9

Notice the new GetAncestor method which takes one variable (the number of steps up the hierarchy) and returns that levels Hierarchyid. In this case just 1 step up the hierarchy.

More methods

There are several more methods to use when working on a hierarchy table – as found on BOL:

GetDescendant – returns a new child node of a given parent. Takes to parameters.

GetLevel – returns the given level for a node (0 index)

GetRoot – returns a root member

ToString – converts a hierarchyid datatype to readable string

IsDescendantOf – returns boolean telling if a given node is a descendant of given parent

Parse – converts a string to a hierarchyid

Read – is used implicit in the ToString method. Cannot be called by the T-SQL statement

GetParentedValue – returns node from new root in case of moving a given node

Write – returns a binary representation of the hierarchyid. Cannot be called by the T-SQL statement.

Optimization

As in many other scenarios of the SQL Server the usual approach to indexing and optimization can be used.

To help on the usual and most used queries I would make below two indexes on the example table:

hierarchyid in SQL Server 10

But with this like with any other indexing strategy – base it on the given scenario and usage.

Goodies

So why use this feature and all the coding work that comes with it?

Well – from my perspective – it has just become very easy to quickly get all elements either up or down from a given node in the hierarchy.

Get all descendants from a specific node

If I would like to get all elements below Jane in the hierarchy I just have to run this command:

hierarchyid in SQL Server 11

Think of the work you would have to do if this was a non hierarchy structured table using only parent/child and recursice sql if the structure was very complex and deep.

I know what I would choose.

Conclusion

As seen above the datatype hierarchyid can be used to give order to the structure of a hierarchy in a way that is both efficient and fairly easy maintained.

If one should optimize the structure even further, then the EmployeeId and the ManagerId could be dropped as the EmployeeId is now as distinct as the OrgPath and can be replaced by this. The ManagerId is only used to build the structure – but this is now also given by the OrgPath.

Happy coding…

External references:

Hierarchyid from MSDN

Using hierarchyid from TechNet

Update SCD Type 2 dimension in one single transaction using only T-SQL

Merge logo new

Recently I got a request inside my organization to make sure that a SCD Type 2 dimension would keep track of the changes due to requrementes from the business.

This needed to be done in a single transaction in pure T-SQL code.

So – what to do and how to do it. Here’s one way.

The sourcetable looks like this:

SCD type 2 dimension

The request was to keep track of changes in the ManagerId according to CaseId.

I’ve created a SCD2 table like this:

CREATE TABLE [dbo].[CaseProjectManagerHistory](
	[dwid] [bigint] IDENTITY(1,1) NOT NULL,
	[CaseId] [int] NULL,
	[ManagerId] [int] NULL,
	[dwDateFrom] [date] NULL,
	[dwDateTo] [date] NULL,
	[dwIsCurrent] [bit] NULL,
	[dwChangeDate] [date] NULL
)


The fields are as follows:
dwid: Identifier for the table
CaseId: The caseid for the rows
ManagerId: The managerid for the row
dwDateFrom: The date from where the row is actual
dwDateTo: The date to where the row is actual
dwIsCurrent: Boolean that tells if the row is the current one or not
dwChangeDate: The date of the change (if the row has changed since the first write)

If you need to catch up on the history types in a dimension – then take a look at Kennie’s blogppost HERE.

First of all I started out with a merge statement that would insert all the new values not in the table and update the ones that needed update.

Something like this:

merge dbo.CaseProjectManagerHistory as target
	using (select CaseId, ManagerId, cast(getdate() as date) as startDate, datefromparts(2199,1,1) as endDate, 1 as [current], cast(getdate() as date) as changeDate from dbo.[Case]) as source
	on target.CaseId = source.CaseId
	when not matched by target
		then
			insert (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
			values (source.CaseId, source.ManagerId, source.startDate, source.endDate, source.[current], source.changeDate)
	when matched 
		and target.dwIsCurrent = 1
		and exists (select source.CaseId, source.ManagerId
					except
					select target.CaseId, target.ManagerId)
		and target.dwChangeDate <= source.ChangeDate
		and source.changeDate < target.dwDateTo
		then
			update set dwIsCurrent = 0, target.dwChangeDate = source.changeDate, target.dwDateTo = dateadd(d,-1,source.startDate)


Those of you who haven’t tried and worked with a merge-statement – you can get the 101 from BOL here.

But this merge statement only inserts new rows and updates existing rows. The rows that are updated still needs to be in the table in order to fully apply to the SCD 2 rules.

This can be done by using the cluse ‘output’ from the merge-statement and then use the output rows to insert into the same table.

It will look like this:

insert into dbo.CaseProjectManagerHistory_demo (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
select CaseId, ManagerId, startDate, endDate, [current], changeDate 
from (
	merge dbo.CaseProjectManagerHistory_demo as target
	using (
		select 
			CaseId
			,ManagerId
			,cast(getdate() as date) as startDate
			,datefromparts(2199,1,1) as endDate
			,1 as [current]
			,cast(getdate() as date) as changeDate 
		from 
			dbo.[Case]
		where 1=1
			and caseid in (2005,2013,2015,2016,2019,2021,2023,2025,2027,2028)
			) as source
	on target.CaseId = source.CaseId
	when not matched by target -- indsæt nye rækker
		then
			insert (CaseId, ManagerId, dwDateFrom, dwDateTo, dwIsCurrent, dwChangeDate)
			values (source.CaseId, source.ManagerId, source.startDate, source.endDate, source.[current], source.changeDate)
	when matched -- opdater eksisterende rækker
		and target.dwIsCurrent = 1
		and exists (select source.CaseId, source.ManagerId --filtrer kun på rækker der ikke allerede eksisterer i target
					except
					select target.CaseId, target.ManagerId)
		and target.dwChangeDate <= source.ChangeDate
		and source.changeDate < target.dwDateTo
		then
		update set dwIsCurrent = 0, target.dwChangeDate = source.changeDate, target.dwDateTo = dateadd(d,-1,source.startDate) 
		output $action ActionOut, source.CaseId, source.ManagerId, source.startDate, source.endDate, source.changeDate, source.[current]) as mergeOutput
where mergeOutput.ActionOut = 'UPDATE';

The mergestatement ‘output’ action is used to insert the same rows to the history table once more. The only change is the ‘end date’.

Happy coding!

Note: I did a short presentation with this at my workplace a few weeks ago, and here Kennie (l, b, t) told me that there is a bug in the merge statement that needs to be taken into account. Read more of that here.

Row level security in SQL Server 2016

With the release of SQL Server 2016 comes many great new features. One of these is the implementation of row level security in the database engine.

This blogpost will cover the aspects of this new feature – including:

  • Setup
  • Best practice
  • Performance
  • Possible security leaks

Introduction

The row level security feature was released earlier this year to Azure – following Microsoft’s cloud-first release concept.

A past big issue with the SQL Server engine was that in only understands tables and columns. Then you had to simulate security using secured views, stored procedures or table value functions. The problem here was to make sure that there were no way to bypass them.

With SQL Server 2016, this is no longer an issue.

Now the SQL Server engine handles the security policy in a central controlled area.

row level security 1

Setup and best practice

The Row-level security is based on a special inline table valued function. This function returns either a single row with a 1 or no rows based on the users rights to that specific row.

Let us take an example:

First of all, I’ll create a database and some users to test with:

CREATE DATABASE RowFilter;
GO
 
USE RowFilter;
GO
 
CREATE USER userBrian WITHOUT LOGIN;
CREATE USER userJames WITHOUT LOGIN;
GO

 A table with examples and grant select to the new users:

CREATE TABLE dbo.SalesFigures (
[userCode] NVARCHAR(10),
[sales] MONEY)
GO
 
INSERT  INTO dbo.SalesFigures
VALUES ('userBrian',100), ('userJames',250), ('userBrian',350)
GO
 
GRANT SELECT ON dbo.SalesFigures TO userBrian
GRANT SELECT ON dbo.SalesFigures TO userJames
GO


Now we’ll add a filter predicate function as below:

CREATE FUNCTION dbo.rowLevelPredicate (@userCode as sysname)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS rowLevelPredicateResult
WHERE @userCode = USER_NAME();
GO


This illustrates that the current user must have associated records in order to get any results. Notice that the functions does not have access to the rows itself.

Furthermore the function can contain joins and lookup tables in the where clause – but beware of the performance hit here. Look further down this post for more info.

The last thing to do is to add a filter predicate to the table dbo.SalesFigures:

CREATE SECURITY POLICY UserFilter
ADD FILTER PREDICATE dbo.rowLevelPredicate(userCode)
ON dbo.SalesFigures
WITH (STATE = ON);
GO


That’s it.

Let’s test the results with the users added before:

 EXECUTE AS USER = 'userBrian';
SELECT * FROM dbo.SalesFigures;
REVERT;
GO


This gives me 2 rows:

EXECUTE AS USER = 'userJames';
SELECT * FROM dbo.SalesFigures;
REVERT;
GO
 


This gives me 1 row:

The execution plan shows a new filter predicate when this row level security is added:

To clean up the examples.

USE master;
DROP DATABASE RowFilter;


Performance

Some might ask, “what about the performance – isn’t there a performance hit in this use of functions?”

The short answer is “It depends”.

If you only use a direct filter on the table there is very little to no impact on the performance. The filter is applied directly to the table as any other filter. Compared to the old way of doing the row filter with stored procedures or table valued functions this new approach is performing better.

If you plan to use lookup tables or joins in the predicate function, then you must beware of the helper tables’ indexes and how fast they can deliver data to the function. If the tables are large and slow performing (without indexes etc.) then you will experience bad performance in the row filter function. But that’s just like any other lookup or join that you might do in your solutions.

Best practices

There are some best practices given from Microsoft:

  • It is highly recommended to create a separate schema for the RLS objects (predicate function and security policy).
  • The ALTER ANY SECURITY POLICY permission is intended for highly-privileged users (such as a security policy manager). The security policy manager does not require SELECT permission on the tables they protect.
  • Avoid type conversions in predicate functions to avoid potential runtime errors.
  • Avoid recursion in predicate functions wherever possible to avoid performance degradation. The query optimizer will try to detect direct recursions, but is not guaranteed to find indirect recursions (i.e., where a second function calls the predicate function).
  • Avoid using excessive table joins in predicate functions to maximize performance.

Possible security leaks

This new row filter context can cause information leakage using some carefully codes queries.

Above example can be breached with the following query:

SELECT 1/([sales]-250) FROM dbo.SalesFigures
WHERE Usercode = 'userJames'

This will give an error: Divide by zero error encountered.


This will tell the user trying to access the table, that userJames has a sale of 250. So even though the row filter prevents users from accessing data that they are not allowed, hackers can still try to determine the data in the table using above method.

Conclusion

The new row level security feature has been very much a wanted feature for quite a while now, and with the function now in place, and planned to be released in the RTM version of SQL Server 2016, the DBA’s and other people working with security can use this out-of-the-box.

I hope this post makes a great start for you if you would like to try out the row level security function. Currently the feature is awailable in the latest CTP version (2.2) – which can be downloaded here: https://www.microsoft.com/en-us/evalcenter/evaluate-sql-server-2016

en_USEnglish