The identity value in relational databases is a field that increases automatically. It is often used to create surrogate primary keys.
Surrogate keys are meaningless and are only used to uniquely identify the row, not the data itself. For example, assume the following table:
CREATE TABLE people ( id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, first_name VARCHAR(200), second_name VARCHAR(200) )
+----+------------+-------------+ | id | first_name | second_name | +----+------------+-------------+ | 1 | Marcin | Gminski | | 2 | Marcin | Gminski | +----+------------+-------------+
As we can see, the identity values are 1, 2 and satisfy the uniqueness of the Primary Key. In such a case, the identity does not prevent duplicate data being inserted into the table. If we wanted to prevent duplicated persons, we would have to create a unique index on
second_name columns but that’s a separate discussion.
So, if the identity fields are meaningless, why do they often start with 1?
Capacity of data types
In our example, the
id field has been defined with the integer data type
INT. An integer is 4 bytes long which means it can accommodate up to 4294967296 values and thus records. However, important to note that the range is from -2147483648 to 2147483647 and therefore the highest (max) number is 2147483647 and not 4294967296.
With this in mind, if we start with 1, the max number of records we can have is 2147483647, exactly half of the integer capacity.
In order to accommodate more records, we would often change the
INT type to
BIGINT is 8 bytes long which could drastically increase the size of the table. We could simply start the identity from the negative value which would give us twice the capacity without increasing the storage utilisation beyond the actual data.
The physical database design is an important factor when building a database. In SQL Server, ROW and PAGE compression will compress smaller values better:
Uses only the bytes that are needed. For example, if a value can be stored in 1 byte, storage will take only 1 byte.
This means that if we start identity with 1 we are going to have a better compression from day 1 on the smaller values. The compression efficiency will decrease as values increase. If we start the identity from the negative value, the compression will be inefficient at the beginning getting better towards the zero and then worsen again.
For example, if we know that our table will not exceed 1 billion rows we can safely start identity from 1 and benefit from more efficient data compression.
This is something I have heard many times:
“We cannot use negative IDs because the application does not handle it”.Developer
This suggests to me that you are misusing surrogate keys. Your application should not rely on surrogate keys, which are by definition meaningless. If your application would break with negative keys, it means they are not meaningless.
When to use negative values
There is no set rule and no right or wrong. A common scenario would be to increase the capacity of the INT field without having to migrate to BIGINT. In that case, considering that lower values compress better, we may want to design the table to seed positive values (1,1) and then manually re-seed to negative (-1,1) values when the field is close to its limit.
There are many aspects that dictate good database design. The first and foremost is to understand the data that will be stored in the database. Using the correct data types, compression and the correct identity field allocation should all be taken into account when designing the database and long term storage capacity.
This post was originally published on 18th January 2021.