Choose your database identifiers wisely

Posted on 2025-08-10, by Racum. Tags: Database UUID Snowflake

Table of Contents

When modeling your database, you’ll inevitably need to identify the rows in your tables. This means picking a column to be its primary key; sounds simple enough: maybe just pick an auto-increment or UUID and go with it, right? But this subject is actually way deeper, and I hope I can give you all the tools you need to make an informed decision in this article.

Types of identifiers

Natural

Natural identifiers are those that already identify data in the “real world”, for example: passport numbers, car plates, usernames, etc. They usually follow a standard or convention, and are normally alphanumeric and human-readable (but not always). Also, they almost always require explicit input, meaning they can’t be automatically filled by sequence, time, or randomness, unless your system is the one responsible for “emitting” them, for example: when you are building a system to emit car plates, instead of a parking system that only logs them.

Not every natural identifier can become a good primary key; some properties are required:

Uniqueness: the guarantee that the value won’t inadvertently identify more than one instance. For example: names (even full ones) are a bad choice to identify people, while the list of codes on ISO 3166-1 (like JPN for Japan) is a very good option for countries.
Universality: when it fits all cases needed for your context. Example: requiring a driver’s license number to finance a car sounds logical, but eventually will hit an exceptional case when someone without a license buys a car for someone else, or for not to be driven at all.
Stability: when the identifier never changes. This one is tricky: things change, even the ones thought to be stable, like the title of a movie, a city name, and even a country name could all change.

If a field can’t match those criteria above, it is not a good fit for a primary key; consider a surrogate key of another type if that happens; you can even keep presenting the imperfect natural identifier in the UI (if that won’t cause any confusion), but use another identifier internally.

A few things to keep in mind:

Since natural identifiers are usually alphanumeric, please consider normalizing the data, like only storing in lower or upper case, or apply/remove formatting elements, like dashes, points, etc. The idea is to pick a pattern and always follow it.
If changes are inevitable, consider keeping a history of all versions. But be careful, that is not always a good idea, for example: a news website forced to change the “slug” for not reflecting a later corrected version. But again, maybe the slug alone shouldn’t identify an article in the first place!
If possible, keep them short! A primary key infers that an index is needed, and with natural keys’ tendency to be alphanumeric, the index of a long text field will be less efficient than a numerical one, so go easy on the size! Plus, this can improve indexing even in our brains by making it more memorable.

Serial

This is probably the most commonly used type of identifier; most Object-Relational Mappers (ORMs) add an implicit auto-increment primary key by default. On the database side, the implementation varies, but the support is always present: some have a special field to handle that automatically, and some use sequence objects. This popularity is totally granted: serials are the most recommended type of identifier for most cases, especially on small databases when there is no need to scale horizontally.

However, when your system grows (in volume or scope), serial identifiers can be a limiting factor:

No scalability: you rely on a single entity being responsible for the sequencing; this both creates a “single point of failure” and disallows the system to be distributed at the database level.
Lack of secrecy:: this is more evident when the identifier is visible on the UI or API; it is possible to “leak” information about the volume of your data, for example: a competitor could see an order with the ID 1000 on your system one day, and 1100 in the next month, “leaking” the factor that you processed 100 orders during that time. Also, if you are not careful with your permissions, this facilitates an enumeration attack.

If the drawbacks above are not an issue now or in the foreseeable future for your specific case, by all means, use it! This is a very reliable option, all implementations of it are very solid, and it is very easy to find help if needed! But mind the field size: if you choose a 32-bit integer, you’ll be limited to a bit more than 4 billion (or 2 if signed). Ideally, pick a 64-bit field.

Time sortable

Also known as "Lexicographical" or "K-Sortable". This is when you encode the time of action (usually creation) into the identifier, making a sorting by that field reflect actions in a chronological order. They behave like "sparse" serial identifiers, with gaps between values growing with time.

The easiest (but also naive) approach is to rely only on time, like using a Unix timestamp. Unless you are 100% sure that your system can’t have nearly simultaneous creations, avoid using just time; also, mind the year 2038 problem.

When well implemented, lexicographical identifiers can be a perfect fit for a distributed system: values can be safely created simultaneously by many instances, assuming it relies on extra information, like some “node” identification (process, machine, datacenter, etc.), or on randomness; Its adoption by major social media companies proves its suitability for distributed systems.

Today, the most popular implementations are UUIDv7 (128-bit) and Snowflakes (64-bit); pick one of those.

Random

Random identifiers are, well… random! They disassociate completely from the content of data and its time or sequence of events. Given enough entropy, they are very distributed-friendly, and are also very secure: unlike lexicographical keys, random ones don’t even “leak” their time of creation.

Although older than that, UUIDv4 became the most popular implementation of a random identifier in the last 10 years or so, gaining support in all main languages, frameworks, and databases. This popularity helped highlight its main weakness: randomness doesn’t play well with database indexes! Well, they “work”, but are less performant and take more space than the index for a serial or lexicographical field would.

If the 128-bit of a UUIDv4 is too big for your case, there are options: for example, YouTube IDs are just a random 64-bit integer, encoded in Base64, and checked if a previous value is already in use during its time of creation; this amazing video by Tom Scott shows this technique quite well.

Overall, only use random keys if you really know what you are doing and your requirements only allow for this option; otherwise, pick a lexicographical identifier instead.

Content derived

This is when we identify the data by the hash of its content; the end result looks similar to random, but is actually deterministic, granting it some interesting properties.

Probably the most familiar case of this type developers encounter daily are git commit IDs, that are just the result of the SHA-1 function (or SHA-256 as an option) being applied over the changes and its metadata.

This is a very advanced technique, not recommended for beginners; but if you are confident in your skills, identifying by hash can give a boost of performance when dealing with value types, like complex objects that happen to be immutable, thus trivially cacheable. Also, be mindful of the space needed: good hashes take a lot of space; for example, the default SHA-1 used by git takes 160 bits, and this function by itself is already considered deprecated; newer variations of SHA-2 and SHA-3 can even reach 512 bits.

Hierarchical

Also known as "Composite" or "Namespaced". This is an umbrella-term to classify multi-level identifiers combining the types described above.

Notable examples are:

Jira-style: usually natural+serial, example: LOG-1234:
- LOG: logistics department,
- 1234: ticket number inside de department.
ISBN-13: list of serials, used for books, example: 978-0-553-38257-0:
- 978-0: language "english",
- 553: publisher "Random House Inc.",
- 38257: title, in this case, "Foundation", from Isaac Asimov,
- 0: checksum.
UPC code, list of serials, used for supermarked goods (barcode), example: 7622201024277:
- 76: region "Switzerland",
- 2220: manufacturer "Milka",
- 102427: product "milk chocolate bar, 150g",
- 7: checksum.

Remember that this is context-based: the same code can be hierarchical for the system creating it, and natural for another one referencing it.

Considerations

Security

Most of the identifier types leak some data; usually this is a tolerable outcome, offset by other advantages, but if you are in a situation that you really need to audit what gets exposed, here is a list:

Natural: no effort to hide anything.
Serial: leaks quantity and progression.
Time sortable: leaks time.
Random: fully opaque.
Content derived: fully opaque.
Hierarchical: depends on the type of its parts.

Performance

If performance is your priority, here is a list of considerations by type:

Natural: depends, usually poor indexing.
Serial: very good indexing.
Time sortable: very good indexing, sharding by time is possible.
Random: very bad indexing.
Content derived: very bad indexing, heavy hash operation on creation, but facilitates caching.
Hierarchical: depends, usually good indexing, sharding by level is possible.

The reason for serial and time sortable having good indexing is that databases rely on a B-Tree structure, that works quite well with increasing numerical values (that get just appended to the end), instead of "jumping around" to add random or hashed values.

Signed / unsigned integers

This is important to check for serial and time sortable types: most databases use signed integers as their native option for those keys, and since they only use positive numbers, that means half of possible serial values, and a drastically shorter time limit for lexicographical ones. The support for unsigned values can be emulated with a “decimal” type in most cases, but you could be entering into “float” territory if you are not careful. The lack of support for unsigned integers doesn’t necessarily need to block a database utilization, just keep in mind that the limits of identifier you picked can be lower than expected.

This is not an issue for UUIDs (internally unsigned 128-bit integers); this is so common that all popular databases have a special field to handle them.

Pro Tips

Look for modern alternatives for legacy fields and operations, for example: PostgreSQL created Identity Columns as an alternative to Serial Types. Those usually come with performance improvements.
If possible, test how performant an implementation is with real data before committing to it.

Implementations

UUID

UUIDs, or "Universally Unique Identifiers", are 128-bit unsigned integers, standardized in the RFC-4122 (versions 1 to 5) and extended in the RFC-9562 (versions 6 to 8). They are usually displayed as 32 hexadecimal characters formatted in 8-4-4-4-12 segments, like this: 00000000-0000-0000-0000-000000000000, but sometimes can appear without the dashes.

Their bits follow this layout:

 Bit Layout:
┏━━ Hexa: ━━┯━━━━━━━━━━━━━━━━━━━ Bits: ━━━━━━━━━━━━━━━━━━━┓
┃ .... .... │ .... ....  .... ....   .... ....  .... .... ┃
┃ .... .... │ .... ....  .... ....   vvvv ....  .... .... ┃
┃ .... .... │ tt.. ....  .... ....   .... ....  .... .... ┃
┃ .... .... │ .... ....  .... ....   .... ....  .... .... ┃
┗━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 Type: tt, version: vvvv.

 Example for 0198937f-2c11-7264-a318-02dfcb75944e: 
┏━━ Hexa: ━━┯━━━━━━━━━━━━━━━━━━━ Bits: ━━━━━━━━━━━━━━━━━━━┓
┃ 0198 937f │ 0000 0001  1001 1000   1001 0011  0111 1111 ┃
┃ 2c11 7264 │ 0010 1100  0001 0001  →0111 0010  0110 0100 ┃
┃ a318 02df │→1010 0011  0001 1000   0000 0010  1101 1111 ┃
┃ cb75 944e │ 1100 1011  0111 0101   1001 0100  0100 1110 ┃
┗━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 Type: 10, verion: 0111 (7).

The bits for the types we want are always 10; if they are different than that, they fall outside the scope of the 2 RFCs, like the custom ones used by NCS and Microsoft (they call it a "GUID").

Versions:

This is a train wreck! It is technically time-based, but not lexicographical.
Used for DCE Security, proprietary, out of scope of RFC-4122.
MD5 hash.
Random. ⭐️
SHA-1 hash.
Truly lexicographical (timestamp + node).
Truly lexicographical (timestamp + random). ⭐️
Open to custom implementations.

I’m only explaining the versions for completion. The only versions you should actually care about are UUIDv4 (random) and UUIDv7 (lexicographical). If you are not sure what to use, default to UUIDv7.

UUID-compatible

Since UUIDs became popular and the databases created a special field to hold them, some implementations took advantage of that to create similar formats that could be encoded as the same unsigned 128-bit integer:

Format	Comment	Example
ULID	This became popular for a while before UUIDv7 got ratified, it fits the same need of timestamp+random. It is displayed as 26 characters encoded in Crockford’s Base32.	`01JCXSGZMZQQJ2M93WC0T8KT02`
Julid	Based on ULID, but set as timestamp+sequence+random, with 16 of the ULID’s original 80 random bits used for sequence.	`01K3ESSGBY0002QCB9YXT6Q6MN`
UPID	This is a very interesting hybrid: it starts as timestamp+random, but ends with some bits dedicated to create a 4-letter word used as a namespace, making it both lexicographical and hierarchical! Like ULID, it is also displayed encoding in Crockford’s Base32 but with a twist: the namespace appears at the beginning.	`user_2accvpp5guht4dts56je5a`
SCRU128	Exactly the same layout as ULID (timestamp+random), but displayed in Base36.	`03cwivkme1qhj3crprqujv4lu`
Timeflake	Another ULID clone, but encoded in Base62.	`02i2XhN7hAuaFh3MwztcMd`
Spotify ID	Encoded in Base62, but with all 128 bits random (no type or version bits).	`4PTG3Z6ehGkBFwjybzWkR8`
TypeID	Used by Jetify; it is another ULID clone, but it is displayed in Base32 with a prefix that is not stored in the 128 bits; it needs to be stored in another column to recreate the rendered version.	`prefix_01h2xcejqtf2nbrexx3vqjhp41` but `prefix_` is not encoded.
Flake	Used by BMC TrueSight Pulse (formerly Boundary); It is lexicographical (timestamp+node+sequence), and gets displayed in Base62.	`8HFaR8qWtRlGDHnO57`
Datadog Trace ID	This is basically the Datadog developers using only the bits they need from the 128 available: it starts with 16 bits of timestamp, followed by 16 bits of zeroes, and finally 32 bits of randomness. It just gets displayed in hex, like normal UUIDs.	`6772800700000000d97a8af26532e259`

Other than those, there are many projects with the same goal of rendering normal UUIDs with fewer characters. They all have their own combinations of encoding, alphabets, and rules. Looks like everyone is "rolling your own"!

Snowflake variants

This is a Lexicographical 64-bit integer open-sourced by Twitter (now X) in 2010; since then, many other companies and open projects started their own variations based on it.

It follows this structure:

Timestamp: to build it, you'll need:
- Bit size: usually around 42 bits, but varies by version.
- Time "tick": usually measured in 1 millisecond, but other units could be used, like 10ms for Sony and 1s for Frostflake.
- Epoch: the Unix timestamp of the zero value. Normally each implementation picks a round-ish time near the beginning of their system; the Unix timestamp itself (1970-01-01) can sometimes be used, but that means wasting 55+ years that could limit the lifetime of the ID.
- Zero bit (optional): some databases (like PostgreSQL and SQLite) only support signed 64-bit integers, that means reserving the first digit the sign, making only 63 bits usable for the rest of the Snowflake; that's why some reserve the first bit as always zero.
Extra fields: usually has those elements (order and existence vary):
- Some kind of segmentation: this is how distributed systems avoid collisions for things that happen at different places in the exact timestamp. Names like "worker", "node", "generator" and "datacenter" are common.
- Sequence: another level of collision avoidance; if things happen fast enough in the same place during the same time "tick", this values gets incremented.
- Random: this is rare, the only example I could find of this was the TSID implemmentation.

Here are a few examples of how the variations layout their bits:

Twitter:
0 00000000000000000000000000000000000000000 0000000000 000000000000
x Timestamp ticks                           Worker     Sequence

Frostflake:
00000000000000000000000000000000 000000000000000000000 00000000000
Timestamp ticks                  Sequence              Generator

TSID:
000000000000000000000000000000000000000000 0000000000000000000000
Timestamp ticks                            Random

Implementation example:

Assuming a Twitter ID like 1777150623882019211:

In binary: 0 00110001010100110110101000100111010000011 0110100000 000110001011
           | |                                         |          |
           | |                                         |          Sequence: 395
           | |                                         Worker: 416
           | Timestamp ticks: 423705726595
           Always zero

Timestamp: (423705726595 * 1) + 1288834974657 = 1712540701252
            |              |    |               |
            |              |    |               Time: 2024-04-08T01:45:01.252Z
            |              |    Epoch
            |              Tick duration: 1ms
            Ticks since epoch

I was able to find all this versions "in the wild":

Variant	Expiration	Unix epoch (ms)	Time "tick"	Time bits	Extra's bits
Twitter	`2080-07-10`	1288834974657	1ms	1 + 41	Worker: 10 Sequence: 12
Discord	`2154-05-15` `2084-09-06` (signed)	1420070400000	1ms	42	Worker: 5 Process: 5 Sequence: 12
Mastodon	`10889-08-02` (!!!) `6429-10-17` (signed)	0	1ms	48	Sequence: 16
Meta (Instagram and Threads)	`2081-04-30` `2046-06-27` (signed)	1314220021721	1ms	41	Shard: 13 Sequence: 10
LinkedIn	`2039-09-07`	0	1ms	1 + 41	Worker: 10 Sequence: 12
Sony	`2188-11-16`	1409529600000	10ms	1 + 39	Sequence: 8 Machine: 8
Spaceflake	`2084-09-06`	1420070400000	1ms	1 + 41	Node: 5 Worker: 5 Sequence: 12
Frostflake	`2106-02-07` `2038-01-19` (signed)	0	1s	32	Sequence: 21 Generator: 11
Flake ID	`2109-05-15` `2039-09-07` (signed)	0	1ms	42	Datacenter: 5 Worker: 5 Ssequence: 12
TID (AT Protocol, Bluesky)	`2255-06-05`	0	1s	1 + 53	Clock: 10
SCRU64	`4261-02-27`	0	256ms	2 + 38	Node: 24
TSID	`2159-05-15` `2089-09-06` (signed)	1577836800000	1ms	42	Random: 22
SnowID	`2163-05-15` `2093-09-06` (signed)	1704067200000	1ms	42	Node: 10 Sequence: 12
Nano64	`2527-06-23` `2248-09-26` (signed)	0	1ms	44	Random: 20

Some variations have the option of an alternative non-numerical representation, but those are just the same 64-bit integers encoded in various forms, usually to look short on URLs:

Variation	Characters	Encoding	Example
TID (Bluesky)	Fixed 13	Base32	`3lfegaoywdk2w`
Threads (Meta)	Up to 11	Base64	`DEr_fXvuw6D`
Frostflake	Up to 11	Base58	`JERHwh5PXjL`
SCRU64	Fixed 12	Base36	`0v20wcjrb21p`
TSID	Fixed 13	Base32 (Crockford's)	`0J4AEXRN106Z0`
SnowID	Up to 11	Base62	`HYOYoYloLw`
Nano64	16 or 17	Hex	`199C01B6659-5861C`

Recommendations:

To make things simpler, I recommend the TSID approach (timestamp + random), this is kinda of a mini version of an UUIDv7.
If you need to have a note + sequence, just pick what is available for your language, or just "roll your own" (it is not that hard).
Whatever the pattern, try to pick an epoch that fits your case.
Mind your database numeric type, if it only allows for signed integers, fine tune the bits and "tick" to make it last.
Never change your implementation once picked.

Other

UUIDs and Snowflakes are not silver bullets; there will always be cases when you need to create your own type of identifier to fit your exclusive needs. Here is a list of formats like that, where the creators were kind enough to share their implementations:

Format	Type	Backed by	Encoding	Example
Xid	Time sortable	Integer: 96 bits	Base32hex	`cst4p962941gd9baqg70`
ObjectId (MongoDB)	Time sortable	Integer: 96 bits	Hex	`6592008029c8c3e4dc76256c`
KSUID	Time sortable	Integer: 160 bits	Base64	`1HCpXwx2EK9oYluWbacgeCnFcLf`
CUID1	Time sortable	ASCII: 25 chars	Base36	`cm3xemk9o00070cm7ghnl6toe`
Puid	Time sortable	ASCII: 12, 14 or 24 chars	Base36	`he5fps6l2504cd1w3ag8ut8e`
PushID (Firebase)	Time sortable	ASCII: 20 chars	Base64	`-OFrJ24CPTXLcIPPjvh3`
Alphadec	Time sortable	Time	Custom	`2025_L0V3_000000`
Sqids / Hashids (old)	Hierarchical: list of integers	ASCII: variable	Custom	`86Rf07xd4z` / `80JTEquWr`
Stripe ID	Hierarchical: natural + random	ASCII: variable	Custom	`cus_lO1DEQWBbQAACfHO`
Slack ID	Hierarchical: natural + random	ASCII: variable	Base36	`C12345ABCDE`
NUID (NATS)	Hierarchical: random + serial	ASCII: 22 chars	Base62	`EQyuCsA4ysv7ezXReOrk4i`
CUID2	Random	ASCII: 24 chars	Base36	`byab6ewccgwheoshq1wk9hds`
Nano ID	Random	ASCII: usually 21 chars	Custom	`XBCdxzsCR2FEFeSwhnjCo`
Breeze ID	Random	ASCII: 19 chars	Custom	`9NU6-XQLZ-BDIH-6HKE`
DUNS Number	Random	Integer: 9 digits	Numeric	`15-048-3782`
Google Docs ID	Random	ASCII: 44 chars	Base64	`1ZWZmZWN0dGhyZW...` (hover)
ASIN (Amazon)	Serial	ASCII: 10 chars	Base36	`B00DQC2FPM`

Conclusion

As you can see, this subject is very deceiving: it looks simple and well-established, but is actually very deep, with lots of nuance to consider! But, if you want a very short answer, here is the abridged version:

If you have a candidate for a natural key, make sure it is unique, universal and stable; if you are not sure, pick other type.
If you are sure your system will never have a need to be distributed, pick a serial; they are basically “set and forget”.
If you know or suspect your system may need to be distributed, pick one of the time sortable options. UUIDv7 is the most straightforward one, but, if you trust your skills and want to save some space, pick one of the Snowflake variants or use them as a basis for a custom variant of your own.
If you are in a situation that you can’t even leak the creation time of your rows (be reasonable, not paranoid), then pick a random identifier, like UUIDv4.
If you have a case of complex immutable objects, consider the content derived (hash) approach.
If you have to integrate many systems or are creating an industry standard: first: try not to reinvent the wheel, look for existing options; if you still need a custom case, consider creating a hierarchical identifier.

I hope this article has been useful. This kind of topic is usually not deeply discussed academically, and in the industry it is common to just go for the familiar patterns. Now you can make a thoughtful and considerate decision next time you model your data.