Have you ever tried to delete an object from the database by mistake or other error? You can undelete object – sometimes.
Then you should read on in this short post.
I recently came across a good co-worker of mine who lost one of the views on the developer database. He called me for help.
Fortunately the database was in FULL RECOVERY mode – so I could extract the object from the database log and send the script to him for his further work that day. I think I saved him a whole day of work…
The undelete object script
Here is the script I used:
select
convert(varchar(max),substring([RowLog Contents 0], 33, LEN([RowLog Contents 0]))) as [Script]
from
fn_dblog(NULL,NULL)
where 1=1
and [Operation]='LOP_DELETE_ROWS'
and [Context]='LCX_MARK_AS_GHOST'
and [AllocUnitName]='sys.sysobjvalues.clst'
This
blogpost will cover some of the basics in recursive CTE’s and explain the
approach done by the SQL Server engine.
First of
all, a quick recap on what a recursive query is.
Recursive
queries are useful when building hierarchies, traverse datasets and generate
arbitrary rowsets etc. The recursive part (simply) means joining a rowset with
itself an arbitrary number of times.
A recursive
query is defined by an anchor set (the base rowset of the recursion) and a
recursive part (the operation that should be done over the previous rowset).
The basics in recursive CTE
A recursive
query helps in a lot of scenarios. For instance, where a dataset is built as a
parent-child relationship and the requirement is to “unfold” this dataset and
show the hierarchy in a ragged format.
A recursive
CTE has a defined syntax – and can be written in general terms like this – and
don’t run way because of the general syntax – a lot of examples (in real code)
will come:
select result_from_previous.*
from result_from_previous
union all
select result_from_current.*
from set_operation(result_from_previous, mytable) as result_from_current
Or
rewritten in another way:
select result_from_previous.*
from result_from_previous
union all
select result_from_current.*
from result_from_previous.*
join mytable
on condition(result_from_previous)
Another way
to write the query (using cross apply):
select result_from_current.*
from result_from_previous
cross apply (
select result_from_previous.*
union all
select *
from mytable
where condition(result_from_previous.*)
) as result_from_current
The last
one – with the cross apply – is row based and a lot slower than the other two.
It iterates over every row from the previous result and computes the scalar
condition (which returns true or false). The same row then gets compared to
each row in mytable and the current row of result_from_previous. When these
conditions are real – the query can be rewritten as a join. Why you should not
use the cross apply for recursive queries.
The reverse
– from join to cross apply – is not always true. To know this, we need to look
at the algebra of distributivity.
Distributivity algebra
Most of us
have already learned that below mathematics is true:
X x (Y + Z) = (X x Y) + (X x Z)
But below
is not always true:
X ^ (Y x Z) = (X ^ Z) x (X ^ Y)
Or said
with words, distributivity means that the order of operations is not important.
The multiplication can be done after the addition and the addition can be done
after the multiplication. The result will be the same no matter what.
This
arithmetic can be used to generate the relational algebra – it’s pretty
straight forward:
set_operation(A union all B, C) = set_operation(A, C) union all set_operation(B, C)
The
condition above is true as with the first condition in the arithmetic.
So the
union all over the operations is the same as the operations over the union all.
This also implies that you cannot use operators like top, distinct, outer join
(more exceptions here). The distribution is not the same
between top over union all and union all over top. Microsoft has done a lot of
good thinking in the recursive approach to reach one ultimate goal – forbid operators
that do not distribute over union all.
With this
information and knowledge our baseline for building a recursive CTE is now in
place.
The first recursive query
Based on
the intro and the above algebra we can now begin to build our first recursive
CTE.
Consider a
sample rowset (sampletree):
id
parentId
name
1
NULL
Ditlev
2
NULL
Claus
3
1
Jane
4
2
John
5
3
Brian
From above
we can see that Brian refers to Jane who refers to Ditlev. And John refers to
Claus. This is fairly easy to read from this rowset – but what if the hierarchy
is more complex and unreadable?
A sample
requirement could be to “unfold” the hierarchy in a ragged hierarchy so it is
directly readable.
The anchor
We start
with the anchor set (Ditlev and Claus). In this dataset the anchor is defined
by parentId is null.
This gives
us an anchor-query like below:
Now on to
the next part.
The recursive
After the anchor part, we are ready to build
the recursive part of the query.
The
recursive part is actually the same query with small differences. The main
select is the same as the anchor part. We need to make a self join in the
select statement for the recursive part.
Before we
dive more into the total statement – I’ll show the statement below. Then I’ll
run through the details.
Back to the
self-reference. Notice the two red underlines in the code. The top one
indicates the CTE’s name and the second line indicates the self-reference. This
is joined directly in the recursive part in order to do the arithmetic logic in
the statement. The join is done between the recursive results parentId and the
id in the anchor result. This gives us the possibility to get the name column
from the anchor statement.
Notice that
I’ve also put in another blank field in the anchor statement and added the
parentName field in the recursive statement. This gives us the “human readable”
output where I can find the hierarchy directly by reading from left to right.
To get data
from the above CTE I just have to make a select statement from this:
And the
results:
I can now
directly read that Jane refers to Ditlev and Brian refers to Jane.
But how is
this done when the SQL engine executes the query – the next part tries to
explain that.
The SQL engines handling
Given the
full CTE statement above I’ll try to explain what the SQL engine does to handle
this.
Run the anchor member creating the
first base result set (T0)
Run the recursive member with Ti
as an input and Ti+1 as an output
Repeat step 3 until an empty result
set is returned
Return the result set. This is a
union all set of T0 to Tn
So let me try
to rewrite the above query to match this sequence.
The anchor
statement we already know:
First
recursive query:
Second
recursive query:
The n
recursive query:
The union
all statement:
This gives
us the exactly same result as we saw before with the rewrite:
Notice that
the statement that I’ve put in above named Tn is actually empty. This to give
the example of the empty statement that makes the SQL engine stop its execution
in the recursive CTE.
This is how
I would describe the SQL engines handling of a recursive CTE.
Based on
this very simple example, I guess you already can think of ways to use this in
your projects and daily tasks.
But what
about the performance and execution plan?
Performance
The
execution plan for the original recursive CTE looks like this:
The top
part of this execution plan is the anchor statement and the bottom part is the
recursive statement.
Notice that
I haven’t made any indexes in the table, so we are reading on heaps here.
But what if
the data is more complex in structure and depth. Let’s try to base the answer
on an example:
From the attached sql code you’ll find a script to generate +20.000 rows in a new table called complextree. This data is from a live solution and contains medical procedure names in a hierarchy. The data is used to show the relationships in medical procedures done by the Danish hospital system. It is both deep and complex in structure. (Sorry for the Danish letters in the data…).
When we run
a recursive CTE on this data – we get the exactly same execution plan:
This is
also what I would expect as the amount of data when read from heaps very seldom
impact on the generated execution plan.
The query
runs on my PC for 25 seconds.
Now let me
put an index in the table and let’s see the performance and execution plan.
The index
is only put on the parentDwId as, according to our knowledge from this article
is the recursive parts join column.
The query
now runs 1 second to completion and generates this execution plan:
The top
line is still the anchor and the bottom part is the recursive part. Notice now
the SQL engine uses the non-clustered index to perform the execution and the
performance gain is noticeable.
Conclusion
I hope that
you’ve now become more familiar with the recursive CTE statement and are
willing to try it on your own projects and tasks.
The basics
is somewhat straight forward – but beware that the query can become complex and
hard to debug as the demand for data and output becomes stronger. But don’t be
scared. As I always say – “Don’t do a complex query all at once, start small
and build it up as you go along”.
I attended
a TDWI conference in May 2016 in Chicago. Here I got a hint about the datatype
hierarchyid in SQL Server which could optimize and eliminate the good old
parent/child hierarchy.
Until then
I (and several other in the class) haven’t heard about the hierarchyid datatype
in SQL Server. So I had to find out and learn this.
Here’s a
blogpost covering some of the aspects of the datatype hierarchyid – including:
Introduction
How
to use it
How
to optimize data in the table
How
to work with data in the hierarchy-structure
Goodies
Introduction
The
datatype hierarchyid was introduced in the SQL Server as from version 2008. It
is a variable length system datatype. The datatype can be used to represent a
given element’s position in a hierarchy – e.g. an employee’s position within an
organization.
The
datatype is extremely compact. The storage is dependent in the average fanout
(fanout = the number of children in all nodes). For smaller fanouts (0-7) the
typical storage is about 6 x Log A * n bits. Where A is the average fanout and
n in the total number of nodes in the tree. Given above formula an organization
with 100,000 employees and a fanout of 6 levels will take around 38 bits –
rounded to 5 bytes of total storage for the hierarchy structure.
Though the
limitation of the datatype is 892 bytes there is a lot of room for extremely
complex and deep structures.
When
representing the values to and from the hierarchyid datatype the syntax is:
[level id 1]/[level id 2]/..[level id n]
Example:
1/7/3
The data
between the ‘/ can be of decimal types e.g. 0.1, 2.3 etc.
Given two
specific levels in the hierarchy a and b given that a < b means that b comes
after a in a depth first order of comparison traversing the tree structure. Any
search and comparison on the tree is done this way by the SQL engine.
The
datatype directly supports deletions and inserts through the GetDescendant
method (see later for full list of methods using this feature). This method
enables generation of siblings to the right of any given node and to the left
of any given node. Even between two siblings. NOTE: when inserting a new node
between two siblings will produce values that are slightly less compact.
Hierarchyid in SQL Server how to use it
Given an
example of data – see compete sql script at the end of this post to generate
the example used in this post.
The Num
field is a simple ascending counter for each level member in the hierarchy.
There are
some basic methods to be used in order to build the hierarchy using the hierarchy
datatype.
GetRoot method
The GetRoot method gives the hierarchyid of the rootnode
in the hierarchy. Represented by the EmployeeId 1 in above example.
The code
and result could look like this:
The value
‘0x’ from the OrgPath field is the representation of the string ‘/’ giving the
root of the hierarchy. This can be seen using a simple cast to varchar
statement:
Building
the new structure with the hierarchyid dataype using a recursive SQL statement:
Notice the
building of the path after the union all. This complies to the above mentioned
syntax for building the hierarchy structure to convert to a hierarchyid
datatype.
If I was to
build the path for the EmployeeId 10 (Name = ‘Mads’) in above example it would
look like this: ‘/2/2/’. A select statement converting the hierarchyid field
OrgPath for the same record, reveals the same thing:
Notice the
use of the ToString method here. Another build in method to use for the
hierarchyid in SQL Server.
GetLevel method
The GetLevel method returns the current nodes level with an
index of 0 from the top:
GetDescendant method
This method
returns a new hierarchyid based on the two parameters child1 and child2.
The use of
these parameters is described in the BOL HERE.
Below is
showed some short examples on the usage.
Getting a
new hierarchyid when a new employee referring to top manager is hired:
Getting a
new hierarchyid when a new hire is referring to Jane on the hierarchy:
Dynamic
insert new records in the hierarchy table – this can easily be converted into a
stored procedure:
Notice the
new GetAncestor method which takes one variable (the number of steps up the
hierarchy) and returns that levels Hierarchyid. In this case just 1 step up the
hierarchy.
More
methods
There are
several more methods to use when working on a hierarchy table – as found on
BOL:
GetDescendant – returns a new child node of a given parent.
Takes to parameters.
GetLevel – returns the given level for a node (0 index)
Read – is used implicit in the ToString method. Cannot
be called by the T-SQL statement
GetParentedValue – returns node from new root in case of moving
a given node
Write – returns a binary representation of the
hierarchyid. Cannot be called by the T-SQL statement.
Optimization
As in many
other scenarios of the SQL Server the usual approach to indexing and
optimization can be used.
To help on
the usual and most used queries I would make below two indexes on the example
table:
But with
this like with any other indexing strategy – base it on the given scenario and
usage.
Goodies
So why use
this feature and all the coding work that comes with it?
Well – from
my perspective – it has just become very easy to quickly get all elements
either up or down from a given node in the hierarchy.
Get all descendants from a
specific node
If I would
like to get all elements below Jane in the hierarchy I just have to run this
command:
Think of
the work you would have to do if this was a non hierarchy structured table
using only parent/child and recursice sql if the structure was very complex and
deep.
I know what
I would choose.
Conclusion
As seen
above the datatype hierarchyid can be used to give order to the structure of a
hierarchy in a way that is both efficient and fairly easy maintained.
If one
should optimize the structure even further, then the EmployeeId and the
ManagerId could be dropped as the EmployeeId is now as distinct as the OrgPath
and can be replaced by this. The ManagerId is only used to build the structure
– but this is now also given by the OrgPath.