In the last week I have introduced heap tables to you. As we have said, a table in SQL Server can be a Heap Table or a Clustered Table - a table with a Clustered Index defined on it. And today we will have a more detailed look on Clustered Indexes, and how to choose the right Clustered Key.
Normally you can assume that your tables have already a Clustered Index defined, because every time when you create a Primary Key constraint in SQL Server, the constraint is (by default) enforced through a Unique Clustered Index. This means that you need to have in that column (or these columns, if you have defined a composite Primary Key) unique values. If you don’t have a Primary Key constraint on your table, you can manually create a Clustered Index through the CREATE CLUSTERED INDEX T-SQL statement and specify the Clustered Key column(s).
As soon as your table has a Clustered Index defined, your table data is physically sorted by the Clustered Key column(s). Let’s have a look on the advantages and disadvantages of Clustered Indexes in SQL Server.
One of the biggest advantages of Clustered Tables is that the data is physically sorted by the Clustered Key in your storage subsystem. You can compare a Clustered Table to a traditional phone book: the phone book is clustered/sorted by last name, means the last name Aschenbrenner comes before Bauer, and Bauer comes before Meyer. Clustered Tables are therefore differ completely from Heap Tables, where you have no physical sorting order.
That’s a really huge benefit you are getting from Clustered Tables. Imagine you are searching for a specific record in the table, and the column in the WHERE clause on which you are restricting your data, is the Clustered Key. In that case SQL Server will choose in the execution plan an Clustered Index Seek operator. Seek operations are very, very efficient, because SQL Server is using here a B-tree structure to find the relevant records. The complexity of a seek operation is always O(log N). If you want to learn more about the internal used B-tree structure, you can also watch my SQL Server Quickie on that topic. Back in 2010 I have also written multiple blog postings about that topic.
It’s the same when you are searching in a phone book for the name Aschenbrenner. You know that this name can be only at the beginning of the phone book, because the phone book is sorted by this data. So you can avoid a scan of the complete phone book, and SQL Server can avoid a complete scan of the Clustered Index in the leaf level.
As long as you have no index fragmentation in your Clustered Index, you will also get sequential I/O when you are accessing the Clustered Index with a scan operation. Index Fragmentation would mean that the logical and physical sorting order of your pages in the leaf level are not the same. You can fix index fragmentation through Index Rebuild and Index Reorganize operations. We will talk more about the difference between both methods in week 24, when we cover database maintenance.
If you have index fragmentation, or not, depends on your chosen Clustered Key column. As long as you are using an ever increasing value (like INT IDENTITY, an OrderDate column), records are inserted at the end of the Clustered Index. This means that you are not introducing a fragmentation in your index, because SQL Server is only appening data at the end of your Clustered Index. But in some rare cases this approach will not scale indefinitely. Therefore we also have to talk now about the disadvantages that you can have with Clustered Indexes, and a wrong chosen Clustered Key.
Data inserted only at the end of the Clustered Index can introduce the so-called Last Page Insert Latch Contention, because you have a single hotspot at the end of your Clustered Index, where multiple queries compete against each other, when traversing through the B-tree structure. The following picture shows this phenomen.
To overcome this problem you can choose a random Clustered Key for your Clustered Index, because then you are distributing the inserted records across multiple different places in the Clustered Index. But a random Clustered Key introduces so-called hard Page Splits, because SQL Server has to allocate new data pages somewhere within the leaf level of the Clustered Index. Hard page splits will also have a negative impact on the performance of your transaction log, because logging a hard page split is a lot more work than logging a “normal” INSERT (a so-called soft Page Split) at the end of your Clustered Index.
As a side-effect you are also introducing index fragmentation with a random Clustered Key, because your logical and physical sorting order are not the same anymore. Random I/O will kill on traditional rotational storage the performance of your scan operations, because the disk head must move forward and backward on the platters of your drive, when reading the various data pages.
Clustered Indexes are scaling very well, because they use internally a B-tree structure. SQL Server can make an effective use of this structure when performing index seek operations on your table. But choosing the right and correct Clustered Key is a time-consuming job, because you have to think about all the advantages and disadvantages of every scenario (ever increasing value vs. random value).
Next week I will talk more about Non-Clustered Indexes in SQL Server. You will learn what a Non-Clustered Index is, and what dependency a Non-Clustered Index has on the Clustered Key defined on the Clustered Index. So enjoy your next 7 days, and then we will meet each other again.