Choose your database identifiers wisely


Posted on 2025-08-10, by Racum. Tags: Database UUID Snowflake

When modeling your database, you’ll inevitably need to identify the rows in your tables. This means picking a column to be its primary key; sounds simple enough: maybe just pick an auto-increment or UUID and go with it, right? But this subject is actually way deeper, and I hope I can give you all the tools you need to make an informed decision in this article.

Types of identifiers

Natural

Natural identifiers are those that already identify data in the β€œreal world”, for example: passport numbers, car plates, usernames, etc. They usually follow a standard or convention, and are normally alphanumeric and human-readable (but not always). Also, they almost always require explicit input, meaning they can’t be automatically filled by sequence, time, or randomness, unless your system is the one responsible for β€œemitting” them, for example: when you are building a system to emit car plates, instead of a parking system that only logs them.

Not every natural identifier can become a good primary key; some properties are required:

  • Uniqueness: the guarantee that the value won’t inadvertently identify more than one instance. For example: names (even full ones) are a bad choice to identify people, while the list of codes on ISO 3166-1 (like JPN for Japan) is a very good option for countries.
  • Universality: when it fits all cases needed for your context. Example: requiring a driver’s license number to finance a car sounds logical, but eventually will hit an exceptional case when someone without a license buys a car for someone else, or for not to be driven at all.
  • Stability: when the identifier never changes. This one is tricky: things change, even the ones thought to be stable, like the title of a movie, a city name, and even a country name could all change.

If a field can’t match those criteria above, it is not a good fit for a primary key; consider a surrogate key of another type if that happens; you can even keep presenting the imperfect natural identifier in the UI (if that won’t cause any confusion), but use another identifier internally.

A few things to keep in mind:

  • Since natural identifiers are usually alphanumeric, please consider normalizing the data, like only storing in lower or upper case, or apply/remove formatting elements, like dashes, points, etc. The idea is to pick a pattern and always follow it.
  • If changes are inevitable, consider keeping a history of all versions. But be careful, that is not always a good idea, for example: a news website forced to change the β€œslug” for not reflecting a later corrected version. But again, maybe the slug alone shouldn’t identify an article in the first place!
  • If possible, keep them short! A primary key infers that an index is needed, and with natural keys’ tendency to be alphanumeric, the index of a long text field will be less efficient than a numerical one, so go easy on the size! Plus, this can improve indexing even in our brains by making it more memorable.

Serial

This is probably the most commonly used type of identifier; most Object-Relational Mappers (ORMs) add an implicit auto-increment primary key by default. On the database side, the implementation varies, but the support is always present: some have a special field to handle that automatically, and some use sequence objects. This popularity is totally granted: serials are the most recommended type of identifier for most cases, especially on small databases when there is no need to scale horizontally.

However, when your system grows (in volume or scope), serial identifiers can be a limiting factor:

  • No scalability: you rely on a single entity being responsible for the sequencing; this both creates a β€œsingle point of failure” and disallows the system to be distributed at the database level.
  • Lack of secrecy:: this is more evident when the identifier is visible on the UI or API; it is possible to β€œleak” information about the volume of your data, for example: a competitor could see an order with the ID 1000 on your system one day, and 1100 in the next month, β€œleaking” the factor that you processed 100 orders during that time. Also, if you are not careful with your permissions, this facilitates an enumeration attack.

If the drawbacks above are not an issue now or in the foreseeable future for your specific case, by all means, use it! This is a very reliable option, all implementations of it are very solid, and it is very easy to find help if needed! But mind the field size: if you choose a 32-bit integer, you’ll be limited to a bit more than 4 billion (or 2 if signed). Ideally, pick a 64-bit field.

Time sortable

Also known as "Lexicographical" or "K-Sortable". This is when you encode the time of action (usually creation) into the identifier, making a sorting by that field reflect actions in a chronological order. They behave like "sparse" serial identifiers, with gaps between values growing with time.

The easiest (but also naive) approach is to rely only on time, like using a Unix timestamp. Unless you are 100% sure that your system can’t have nearly simultaneous creations, avoid using just time; also, mind the year 2038 problem.

When well implemented, lexicographical identifiers can be a perfect fit for a distributed system: values can be safely created simultaneously by many instances, assuming it relies on extra information, like some β€œnode” identification (process, machine, datacenter, etc.), or on randomness; Its adoption by major social media companies proves its suitability for distributed systems.

Today, the most popular implementations are UUIDv7 (128-bit) and Snowflakes (64-bit); pick one of those.

Random

Random identifiers are, well… random! They disassociate completely from the content of data and its time or sequence of events. Given enough entropy, they are very distributed-friendly, and are also very secure: unlike lexicographical keys, random ones don’t even β€œleak” their time of creation.

Although older than that, UUIDv4 became the most popular implementation of a random identifier in the last 10 years or so, gaining support in all main languages, frameworks, and databases. This popularity helped highlight its main weakness: randomness doesn’t play well with database indexes! Well, they β€œwork”, but are less performant and take more space than the index for a serial or lexicographical field would.

If the 128-bit of a UUIDv4 is too big for your case, there are options: for example, YouTube IDs are just a random 64-bit integer, encoded in Base64, and checked if a previous value is already in use during its time of creation; this amazing video by Tom Scott shows this technique quite well.

Overall, only use random keys if you really know what you are doing and your requirements only allow for this option; otherwise, pick a lexicographical identifier instead.

Content derived

This is when we identify the data by the hash of its content; the end result looks similar to random, but is actually deterministic, granting it some interesting properties.

Probably the most familiar case of this type developers encounter daily are git commit IDs, that are just the result of the SHA-1 function (or SHA-256 as an option) being applied over the changes and its metadata.

This is a very advanced technique, not recommended for beginners; but if you are confident in your skills, identifying by hash can give a boost of performance when dealing with value types, like complex objects that happen to be immutable, thus trivially cacheable. Also, be mindful of the space needed: good hashes take a lot of space; for example, the default SHA-1 used by git takes 160 bits, and this function by itself is already considered deprecated; newer variations of SHA-2 and SHA-3 can even reach 512 bits.

Hierarchical

Also known as "Composite" or "Namespaced". This is an umbrella-term to classify multi-level identifiers combining the types described above.

Notable examples are:

  • Jira-style: usually natural+serial, example: LOG-1234:
    • LOG: logistics department,
    • 1234: ticket number inside de department.
  • ISBN-13: list of serials, used for books, example: 978-0-553-38257-0:
    • 978-0: language "english",
    • 553: publisher "Random House Inc.",
    • 38257: title, in this case, "Foundation", from Isaac Asimov,
    • 0: checksum.
  • UPC code, list of serials, used for supermarked goods (barcode), example: 7622201024277:
    • 76: region "Switzerland",
    • 2220: manufacturer "Milka",
    • 102427: product "milk chocolate bar, 150g",
    • 7: checksum.

Remember that this is context-based: the same code can be hierarchical for the system creating it, and natural for another one referencing it.

Considerations

Security

Most of the identifier types leak some data; usually this is a tolerable outcome, offset by other advantages, but if you are in a situation that you really need to audit what gets exposed, here is a list:

  • Natural: no effort to hide anything.
  • Serial: leaks quantity and progression.
  • Time sortable: leaks time.
  • Random: fully opaque.
  • Content derived: fully opaque.
  • Hierarchical: depends on the type of its parts.

Performance

If performance is your priority, here is a list of considerations by type:

  • Natural: depends, usually poor indexing.
  • Serial: very good indexing.
  • Time sortable: very good indexing, sharding by time is possible.
  • Random: very bad indexing.
  • Content derived: very bad indexing, heavy hash operation on creation, but facilitates caching.
  • Hierarchical: depends, usually good indexing, sharding by level is possible.

The reason for serial and time sortable having good indexing is that databases rely on a B-Tree structure, that works quite well with increasing numerical values (that get just appended to the end), instead of "jumping around" to add random or hashed values.

Signed / unsigned integers

This is important to check for serial and time sortable types: most databases use signed integers as their native option for those keys, and since they only use positive numbers, that means half of possible serial values, and a drastically shorter time limit for lexicographical ones. The support for unsigned values can be emulated with a β€œdecimal” type in most cases, but you could be entering into β€œfloat” territory if you are not careful. The lack of support for unsigned integers doesn’t necessarily need to block a database utilization, just keep in mind that the limits of identifier you picked can be lower than expected.

This is not an issue for UUIDs (internally unsigned 128-bit integers); this is so common that all popular databases have a special field to handle them.

Pro Tips

  • Look for modern alternatives for legacy fields and operations, for example: PostgreSQL created Identity Columns as an alternative to Serial Types. Those usually come with performance improvements.
  • If possible, test how performant an implementation is with real data before committing to it.

Implementations

UUID

UUIDs, or "Universally Unique Identifiers", are 128-bit unsigned integers, standardized in the RFC-4122 (versions 1 to 5) and extended in the RFC-9562 (versions 6 to 8). They are usually displayed as 32 hexadecimal characters formatted in 8-4-4-4-12 segments, like this: 00000000-0000-0000-0000-000000000000, but sometimes can appear without the dashes.

Their bits follow this layout:

 Bit Layout:
┏━━ Hexa: ━━┯━━━━━━━━━━━━━━━━━━━ Bits: ━━━━━━━━━━━━━━━━━━━┓
┃ .... .... β”‚ .... ....  .... ....   .... ....  .... .... ┃
┃ .... .... β”‚ .... ....  .... ....   vvvv ....  .... .... ┃
┃ .... .... β”‚ tt.. ....  .... ....   .... ....  .... .... ┃
┃ .... .... β”‚ .... ....  .... ....   .... ....  .... .... ┃
┗━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 Type: tt, version: vvvv.

 Example for 0198937f-2c11-7264-a318-02dfcb75944e: 
┏━━ Hexa: ━━┯━━━━━━━━━━━━━━━━━━━ Bits: ━━━━━━━━━━━━━━━━━━━┓
┃ 0198 937f β”‚ 0000 0001  1001 1000   1001 0011  0111 1111 ┃
┃ 2c11 7264 β”‚ 0010 1100  0001 0001  β†’0111 0010  0110 0100 ┃
┃ a318 02df β”‚β†’1010 0011  0001 1000   0000 0010  1101 1111 ┃
┃ cb75 944e β”‚ 1100 1011  0111 0101   1001 0100  0100 1110 ┃
┗━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
 Type: 10, verion: 0111 (7).

The bits for the types we want are always 10; if they are different than that, they fall outside the scope of the 2 RFCs, like the custom ones used by NCS and Microsoft (they call it a "GUID").

Versions:

  1. This is a train wreck! It is technically time-based, but not lexicographical.
  2. Used for DCE Security, proprietary, out of scope of RFC-4122.
  3. MD5 hash.
  4. Random. ⭐️
  5. SHA-1 hash.
  6. Truly lexicographical (timestamp + node).
  7. Truly lexicographical (timestamp + random). ⭐️
  8. Open to custom implementations.

I’m only explaining the versions for completion. The only versions you should actually care about are UUIDv4 (random) and UUIDv7 (lexicographical). If you are not sure what to use, default to UUIDv7.

UUID-compatible

Since UUIDs became popular and the databases created a special field to hold them, some implementations took advantage of that to create similar formats that could be encoded as the same unsigned 128-bit integer:

FormatCommentExample
ULIDThis became popular for a while before UUIDv7 got ratified, it fits the same need of timestamp+random. It is displayed as 26 characters encoded in Crockford’s Base32.01JCXSGZMZQQJ2M93WC0T8KT02
UPIDThis is a very interesting hybrid: it starts as timestamp+random, but ends with some bits dedicated to create a 4-letter word used as a namespace, making it both lexicographical and hierarchical! Like ULID, it is also displayed encoding in Crockford’s Base32 but with a twist: the namespace appears at the beginning.user_2accvpp5guht4dts56je5a
SCRU128Exactly the same layout as ULID (timestamp+random), but displayed in Base36.03cwivkme1qhj3crprqujv4lu
TimeflakeAnother ULID clone, but encoded in Base62.02i2XhN7hAuaFh3MwztcMd
TypeIDUsed by Jetify; it is another ULID clone, but it is displayed in Base32 with a prefix that is not stored in the 128 bits; it needs to be stored in another column to recreate the rendered version.prefix_01h2xcejqtf2nbrexx3vqjhp41
but prefix_ is not encoded.
FlakeUsed by BMC TrueSight Pulse (formerly Boundary); It is lexicographical (timestamp+node+sequence), and gets displayed in Base62.8HFaR8qWtRlGDHnO57
Datadog Trace IDThis is basically the Datadog developers using only the bits they need from the 128 available: it starts with 16 bits of timestamp, followed by 16 bits of zeroes, and finally 32 bits of randomness. It just gets displayed in hex, like normal UUIDs.6772800700000000d97a8af26532e259

Other than those, there are many projects with the same goal of rendering normal UUIDs with fewer characters. They all have their own combinations of encoding, alphabets, and rules. Looks like everyone is "rolling your own"!

Snowflake variants

This is a Lexicographical 64-bit integer open-sourced by Twitter (now X) in 2010; since then, many other companies and open projects started their own variations based on it.

It follows this structure:

  • Timestamp: to build it, you'll need:
    • Bit size: usually around 42 bits, but varies by version.
    • Time "tick": usually measured in 1 millisecond, but other units could be used, like 10ms for Sony and 1s for Frostflake.
    • Epoch: the Unix timestamp of the zero value. Normally each implementation picks a round-ish time near the beginning of their system; the Unix timestamp itself (1970-01-01) can sometimes be used, but that means wasting 55+ years that could limit the lifetime of the ID.
    • Zero bit (optional): some databases (like PostgreSQL and SQLite) only support signed 64-bit integers, that means reserving the first digit the sign, making only 63 bits usable for the rest of the Snowflake; that's why some reserve the first bit as always zero.
  • Extra fields: usually has those elements (order and existence vary):
    • Some kind of segmentation: this is how distributed systems avoid collisions for things that happen at different places in the exact timestamp. Names like "worker", "node", "generator" and "datacenter" are common.
    • Sequence: another level of collision avoidance; if things happen fast enough in the same place during the same time "tick", this values gets incremented.
    • Random: this is rare, the only example I could find of this was the TSID implemmentation.

Here are a few examples of how the variations layout their bits:

Twitter:
0 00000000000000000000000000000000000000000 0000000000 000000000000
x Timestamp ticks                           Worker     Sequence

Frostflake:
00000000000000000000000000000000 000000000000000000000 00000000000
Timestamp ticks                  Sequence              Generator

TSID:
000000000000000000000000000000000000000000 0000000000000000000000
Timestamp ticks                            Random

Implementation example:

Assuming a Twitter ID like 1777150623882019211:

In binary: 0 00110001010100110110101000100111010000011 0110100000 000110001011
           | |                                         |          |
           | |                                         |          Sequence: 395
           | |                                         Worker: 416
           | Timestamp ticks: 423705726595
           Always zero

Timestamp: (423705726595 * 1) + 1288834974657 = 1712540701252
            |              |    |               |
            |              |    |               Time: 2024-04-08T01:45:01.252Z
            |              |    Epoch
            |              Tick duration: 1ms
            Ticks since epoch

I was able to find all this versions "in the wild":

VariantExpirationUnix epoch (ms)Time
"tick"
Time
bits
Extra's bits
Twitter2080-07-1012888349746571ms1 + 41Worker: 10
Sequence: 12
Discord2154-05-15
2084-09-06 (signed)
14200704000001ms42Worker: 5
Process: 5
Sequence: 12
Mastodon10889-08-02 (!!!)
6429-10-17 (signed)
01ms48Sequence: 16
Meta
(Instagram and Threads)
2081-04-30
2046-06-27 (signed)
13142200217211ms41Shard: 13
Sequence: 10
LinkedIn2039-09-0701ms1 + 41Worker: 10
Sequence: 12
Sony2188-11-16140952960000010ms1 + 39Sequence: 8
Machine: 8
Spaceflake2084-09-0614200704000001ms1 + 41Node: 5
Worker: 5
Sequence: 12
Frostflake2106-02-07
2038-01-19 (signed)
01s32Sequence: 21
Generator: 11
Flake ID2109-05-15
2039-09-07 (signed)
01ms42Datacenter: 5
Worker: 5
Ssequence: 12
TID
(AT Protocol, Bluesky)
2255-06-0501s1 + 53Clock: 10
SCRU644261-02-270256ms2 + 38Node: 24
TSID2159-05-15
2089-09-06 (signed)
15778368000001ms42Random: 22

Some variations have the option of an alternative non-numerical representation, but those are just the same 64-bit integers encoded in various forms, usually to look short on URLs:

VariationCharactersEncodingExample
TID (Bluesky)Fixed 13Base323lfegaoywdk2w
Threads (Meta)Up to 11Base64DEr_fXvuw6D
FrostflakeUp to 11Base58JERHwh5PXjL
SCRU64Fixed 12Base360v20wcjrb21p
TSIDFixed 13Base32 (Crockford's)0J4AEXRN106Z0

Recommendations:

  • To make things simpler, I recommend the TSID approach (timestamp + random), this is kinda of a mini version of an UUIDv7.
  • If you need to have a note + sequence, just pick what is available for your language, or just "roll your own" (it is not that hard).
  • Whatever the pattern, try to pick an epoch that fits your case.
  • Mind your database numeric type, if it only allows for signed integers, fine tune the bits and "tick" to make it last.
  • Never change your implementation once picked.

Other

UUIDs and Snowflakes are not silver bullets; there will always be cases when you need to create your own type of identifier to fit your exclusive needs. Here is a list of formats like that, where the creators were kind enough to share their implementations:

FormatTypeBacked byEncodingExample
XidTime sortableInteger: 96 bitsBase32hexcst4p962941gd9baqg70
ObjectId (MongoDB)Time sortableInteger: 96 bitsHex6592008029c8c3e4dc76256c
KSUIDTime sortableInteger: 160 bitsBase641HCpXwx2EK9oYluWbacgeCnFcLf
CUID1Time sortableASCII: 25 charsBase36cm3xemk9o00070cm7ghnl6toe
PuidTime sortableASCII: 12, 14 or 24 charsBase36he5fps6l2504cd1w3ag8ut8e
PushID (Firebase)Time sortableASCII: 20 charsBase64-OFrJ24CPTXLcIPPjvh3
Sqids / Hashids (old)Hierarchical: list of integersASCII: variableCustom86Rf07xd4z / 80JTEquWr
Stripe IDHierarchical: natural + randomASCII: variableCustromcus_lO1DEQWBbQAACfHO
NUID (NATS)Hierarchical: random + serialASCII: 22 charsBase62EQyuCsA4ysv7ezXReOrk4i
CUID2RandomASCII: 24 charsBase36byab6ewccgwheoshq1wk9hds
Nano IDRandomASCII: usually 21 charsCustomXBCdxzsCR2FEFeSwhnjCo
Breeze IDRandomASCII: 19 charsCustom9NU6-XQLZ-BDIH-6HKE
AlphadecTime sortableTimeCustom2025_L0V3_000000

Conclusion

As you can see, this subject is very deceiving: it looks simple and well-established, but is actually very deep, with lots of nuance to consider! But, if you want a very short answer, here is the abridged version:

  • If you have a candidate for a natural key, make sure it is unique, universal and stable; if you are not sure, pick other type.
  • If you are sure your system will never have a need to be distributed, pick a serial; they are basically β€œset and forget”.
  • If you know or suspect your system may need to be distributed, pick one of the time sortable options. UUIDv7 is the most straightforward one, but, if you trust your skills and want to save some space, pick one of the Snowflake variants or use them as a basis for a custom variant of your own.
  • If you are in a situation that you can’t even leak the creation time of your rows (be reasonable, not paranoid), then pick a random identifier, like UUIDv4.
  • If you have a case of complex immutable objects, consider the content derived (hash) approach.
  • If you have to integrate many systems or are creating an industry standard: first: try not to reinvent the wheel, look for existing options; if you still need a custom case, consider creating a hierarchical identifier.

I hope this article has been useful. This kind of topic is usually not deeply discussed academically, and in the industry it is common to just go for the familiar patterns. Now you can make a thoughtful and considerate decision next time you model your data.