SQL Server Primary Key vs. Clustered Index: Part 1

Note: This is Part 1 of a three-part article on SQL Server primary keys and clustered indexes.

Many SQL Server data architects and developers have been strong proponents of letting an identity column define the primary key for large tables. Why not? This article examines the pros and cons.

Should the definition of a primary key define the clustered index?

The identity property is always increasing and newly inserted rows are always inserted at the end of the table,  thus resembling (but not always) the properties of a CreateDate column. Old data is then at one end of the table and new data at the other end of the table.

One of the cons to this approach is that you may encounter hot spots at the new data end of the table, but I’ve never encountered this issue. Another plus for using the identity property is that it is an integer and keeps foreign key relationships and indexes to a relatively small size.

With this approach I’ve always let the definition of the primary key define the SQL Server clustered index:

where the default is to create a clustered index.

However, while researching how best to define a partition key (for large tables – further discussed in Part 3 of this article), it finally dawned on me that perhaps it might be better to break apart the Primary Key (i.e., nonclustered) and define a separate unique clustered index based on the Primary Key and a natural key (something like CreateDate). Many of the stored procedures and ad hoc queries I’ve encountered in the business world have some dependency on CreateDate and although the identity column is monotonically increasing it has no relationship with the CreateDate column.

Consider an example

Consider the following table definition:

With the exception perhaps of the computed column CreateDateKey this is a typical table definition using the identity property as both the SQL Server Primary Key and the Clustered Index. In the instance the CreateDateKey is an integer of the form yyyymmdd.

Consider using the CreateDateKey (as defined above) and the Primary Key for the clustered index for the table instead of the original definition:

1. Need to persist CreateDateKey to use as Clustered index–this causes a schema lock

2. Delete old PK

3. Generate unique clustered index on CreateDateKey,PackageID–this does cause schema locks

4. Rebuild PK

The #1 TSQL statements drops the original computed column, and then recreates it with the PERSISTED property. This is necessary for the definition of the SQL Server clustered index.

The #2 TSQL statement drops the original Primary Key. It will fail if there are any other tables with foreign key relationships to dbo.Package. (this will be discussed later in a future article with a stored procedure to identify and generate drop and create statements on any existing dependent foreign keys.)

The #3 TSQL statement generates a unique, clustered index based on the computed column CreateDateKey and the original SQL Server Primary Key (identity column). Obviously this is unique since the original Primary Key is unique. Uniqueness is a property that helps the optimizer pick a better query plan. Since query plan generation looks at the selectivity of only the first column in a multi-column index, I picked the CreateDateKey first so the optimizer might select a seek (rather than a scan) when CreateDateKey is a parameter in the WHERE clause. The second line can only be used for Enterprise (or higher) editions of SQL Server 2008 R2 and higher. The ONLINE property allows the index to be created, but still giving access to other users to the table data. ONLINE is slightly slower than OFFLINE (default). The row compression property will be discussed in a future article. This line can be eliminated in the standard or express editions of SQL Server.

The #4 TSQL statement re-generates the Primary Key but now it is non-clustered. Any dependent Foreign Keys can then be re-created with their original definitions.

Part 2 of this article will discuss differences in query plans between the original table and the revised table with some benchmarking numbers for typical scenarios. The results were better than expected – stay tuned.

  • RDW2

    I have gotten into some extended (and often heated) “discussions” regarding whether to use Identity columns as PK’s or, for that matter, at all. Keys that are actually part of the data being stored in tables and used for purposes other than strictly as identifiers are usually termed “Natural Keys” while others, such as Identity or GUID columns, are termed “surrogate” Keys and some really big names in the SQL Server community are horrified by the thought of using a Surrogate Key for your PK . . . which I generally find to be quite humorous. ;-)

    Besides the obvious point that you have made that Identities are usually INT datatypes and smaller than, for instance names or other Natural Keys, I have never found a Natural Key that is absoultely NEVER going to get modified.

    So far, I like where you seem to be going with this series, especially the use of the Identity column along with another column to establish the uniqueness of the multi-column index.

  • MByrd

    Thanks for your interest. I am had several long discussion with Joe Celko here in Austin about identity columns as PKs versus natural keys. My interest is performance and anything that can improve performance is on my side. Smaller, narrower indexes are our friends. Hope you enjoy the rest of the article.

  • Pingback: SQL Server Primary Key vs Clustered Index, Part 3()

  • Pingback: SQL Server Primary Key vs. Clustered Index: Part 2()