How to Choose the Right ID Type for Database Entities

When designing entities in databases, one of the most critical decisions is what ID type to use. While it might not seem like a big problem in small applications, a poorly chosen ID type can lead to significant challenges as your application scales.

For instance, on X, there was this realization that UUIDs are not sortable (not v7 though). Unfortunately, by the time this issue was discovered, addressing it will require extensive work or compromises.

Important

If your application is small, with a few hundred thousand rows or less, overthinking this might be overkill. In such cases, integer keys or any familiar option should suffice, as performance differences will be marginal.

Why is the Type of ID Important?

Most ORMs and migration libraries default to integer-based IDs for database entities. While this works well for small to medium applications, the choice of ID type becomes critical in high-throughput and large-scale systems.

Non-Sortable IDs Can Cause Performance Issues

Non-sortable IDs, such as UUID (not v7), can lead to significant performance issues, especially during write-intensive operations.

For example, Shopify improved their database write performance by about 50% by switching from UUIDv4 to Universally Unique Lexicographically Sortable Identifier (ULID). Read more on the Shopify Engineering Blog.

Note

From Shopify's Blog: "In one high-throughput system at Shopify, we’ve seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for idempotency keys."

Deterministic IDs Like Integers Can Pose Security Risks

Exposing deterministic IDs, especially integer-based IDs, can lead to:

Data Security Risks: OWASP refers to this as Insecure Direct Object References. If someone discovers an entity's ID and your application lacks proper authorization controls, they could exploit it.
Business Intelligence Risks: Exposing integer-based IDs in URLs can reveal:
- Business Size: Competitors can estimate the total number of entities in your system, such as orders, users, or transactions.
- Business Velocity: By observing changes in IDs over time, competitors can infer your growth rate. For example, if the max order ID increases from 6000 to 6010 in a week, they know you processed only 10 orders that week.

Characteristics of Good ID Types in Distributed Systems

Idempotent & Universally Unique: Timestamp-based IDs like ULID or Snowflake IDs prevent collisions and provide a true sense of uniqueness in distributed systems.
Sortable: Sortable IDs improve the performance of index updates (e.g. B-tree indexes in MySQL) and enable efficient data processing.
Reasonable Data Size: Since IDs are often indexed, their size matters. Integer IDs are smaller than string-based IDs, which is important for databases with index size limits. For example, MySQL has a hard limit on index key length:
Warning
Prefix support and lengths of prefixes (where supported) depend on the storage engine. For example, InnoDB tables using the DYNAMIC or COMPRESSED row format have a prefix length limit of 3072 bytes. Read More.
Non-Deterministic: To avoid issues like Insecure Direct Object References and exposing business intelligence.

Why Sortable IDs?

In the tweet mentioned earlier, someone asked why sortable IDs are necessary. Here are a few reasons:

Performance: Sortable IDs optimize database write efficiency by improving index updates.
Chunking Large Datasets: They enable efficient processing of large datasets by dividing them into manageable and definite sized chunks where fields like creation timestamp can't guarantee.
sql
```
SELECT `id`, `field1`, `field2`
FROM `table`
WHERE `id` >= '<offset-id>'
ORDER BY `id` ASC
LIMIT 10000
```
ID-based offset pagination: Sortable IDs allow consistent and performant pagination in large datasets when using the id as the offset field.
Better UX in Exported Data: They ensure exported data (e.g., CSV files) is ordered logically, improving readability and usability.

Examples of Good ID Types in Distributed Systems

Caution

This is not an exhaustive list. Apologies if your favorite ID type is missing.

ULID (Universally Unique Lexicographically Sortable Identifier)

128-bit identifiers encoded as 26-character strings.
High collision tolerance, with 280 possible IDs per millisecond.
Example: 01ARZ3NDEKTSV4RRFFQ69G5FAV
Read More

Snowflake IDs

Time-based unique IDs that include a timestamp, machine ID, and sequence number.
Popular in large distributed systems (e.g., Twitter).
Example: 1541815603606036480
Read More

NanoID with Timestamp

A customizable ID generator that can prepend a timestamp for sorting.
Check out this popular Node.js package.

UUIDv7

A time-sortable version of UUID with precise timestamps (up to 50 nanoseconds).
Not natively supported by most SQL databases yet, so it must be generated at the application layer.
Example: 0195d095-a57d-716f-9397-8c8d63054109
Read More

Unsigned Big Integers

Ideal for systems requiring integer IDs. Use unsigned big integers to avoid outgrowing 32-bit integer limits.
Example: 18446744073709551615 (max value for unsigned 64-bit integers).

Redemption for Integer-Based IDs

In my experience, building large data systems, I often combine integer-based IDs (for internal use) with string-based IDs (e.g., ULID or UUIDv7) for external use.

Why Not Both?

Internal IDs (Integer-Based):
- Readability: Easier to understand and debug.
- Performance: Integer comparisons are faster than string comparisons.
- Data Size: Integer keys use less memory and disk space in indexes.
External IDs (String-Based):
- Security: Mitigates risks like Insecure Direct Object References.
- Log Identification: Easier to search logs with globally unique keys.
- Globally Unique: Identifiable even within other services.

Implementation

For entities exposed externally, use both id (integer) and public_id (string):

id auto-increment primary key and the public_id can be generated through creation hooks or even by the database if supported. Example using creating hook in Laravel:

php

namespace App\Models;

use Illuminate\Database\Eloquent\Model;
use Illuminate\Support\Str;

class Entity extends Model
{
    protected static function booted(): void
    {
        parent::booted();

        static::creating(function (Model $model) {
            $model->public_id ??= Str::ulid();
        });
    }
}

Use public_id for external interactions (e.g., JSON responses, URLs) and id for internal queries, joins, and foreign keys.

Conclusion

Choosing the right ID type for your database entities is a critical decision that impacts performance, scalability, and security. By combining integer-based IDs for internal operations with string-based IDs for external interactions, you can achieve the best of both worlds: performance and security. Ultimately, the choice should align with your application's specific requirements, balancing simplicity, scalability, and future-proofing.

How to Choose the Right ID Type for Database Entities

Why is the Type of ID Important? ​

Non-Sortable IDs Can Cause Performance Issues ​

Deterministic IDs Like Integers Can Pose Security Risks ​

Characteristics of Good ID Types in Distributed Systems ​

Why Sortable IDs? ​

Examples of Good ID Types in Distributed Systems ​

ULID (Universally Unique Lexicographically Sortable Identifier) ​

Snowflake IDs ​

NanoID with Timestamp ​

UUIDv7 ​

Unsigned Big Integers ​

Redemption for Integer-Based IDs ​

Why Not Both? ​

Implementation ​

Conclusion ​

Hi 👋🏼, I am Gideon Appoh