Choose your database identifiers wisely
Posted on 2025-08-10, by Racum.
When modeling your database, youβll inevitably need to identify the rows in your tables. This means picking a column to be its primary key; sounds simple enough: maybe just pick an auto-increment or UUID and go with it, right? But this subject is actually way deeper, and I hope I can give you all the tools you need to make an informed decision in this article.
Types of identifiers
Natural
Natural identifiers are those that already identify data in the βreal worldβ, for example: passport numbers, car plates, usernames, etc. They usually follow a standard or convention, and are normally alphanumeric and human-readable (but not always). Also, they almost always require explicit input, meaning they canβt be automatically filled by sequence, time, or randomness, unless your system is the one responsible for βemittingβ them, for example: when you are building a system to emit car plates, instead of a parking system that only logs them.
Not every natural identifier can become a good primary key; some properties are required:
- Uniqueness: the guarantee that the value wonβt inadvertently identify more than one instance. For example: names (even full ones) are a bad choice to identify people, while the list of codes on ISO 3166-1 (like
JPN
for Japan) is a very good option for countries. - Universality: when it fits all cases needed for your context. Example: requiring a driverβs license number to finance a car sounds logical, but eventually will hit an exceptional case when someone without a license buys a car for someone else, or for not to be driven at all.
- Stability: when the identifier never changes. This one is tricky: things change, even the ones thought to be stable, like the title of a movie, a city name, and even a country name could all change.
If a field canβt match those criteria above, it is not a good fit for a primary key; consider a surrogate key of another type if that happens; you can even keep presenting the imperfect natural identifier in the UI (if that wonβt cause any confusion), but use another identifier internally.
A few things to keep in mind:
- Since natural identifiers are usually alphanumeric, please consider normalizing the data, like only storing in lower or upper case, or apply/remove formatting elements, like dashes, points, etc. The idea is to pick a pattern and always follow it.
- If changes are inevitable, consider keeping a history of all versions. But be careful, that is not always a good idea, for example: a news website forced to change the βslugβ for not reflecting a later corrected version. But again, maybe the slug alone shouldnβt identify an article in the first place!
- If possible, keep them short! A primary key infers that an index is needed, and with natural keysβ tendency to be alphanumeric, the index of a long text field will be less efficient than a numerical one, so go easy on the size! Plus, this can improve indexing even in our brains by making it more memorable.
Serial
This is probably the most commonly used type of identifier; most Object-Relational Mappers (ORMs) add an implicit auto-increment primary key by default. On the database side, the implementation varies, but the support is always present: some have a special field to handle that automatically, and some use sequence objects. This popularity is totally granted: serials are the most recommended type of identifier for most cases, especially on small databases when there is no need to scale horizontally.
However, when your system grows (in volume or scope), serial identifiers can be a limiting factor:
- No scalability: you rely on a single entity being responsible for the sequencing; this both creates a βsingle point of failureβ and disallows the system to be distributed at the database level.
- Lack of secrecy:: this is more evident when the identifier is visible on the UI or API; it is possible to βleakβ information about the volume of your data, for example: a competitor could see an order with the ID 1000 on your system one day, and 1100 in the next month, βleakingβ the factor that you processed 100 orders during that time. Also, if you are not careful with your permissions, this facilitates an enumeration attack.
If the drawbacks above are not an issue now or in the foreseeable future for your specific case, by all means, use it! This is a very reliable option, all implementations of it are very solid, and it is very easy to find help if needed! But mind the field size: if you choose a 32-bit integer, youβll be limited to a bit more than 4 billion (or 2 if signed). Ideally, pick a 64-bit field.
Time sortable
Also known as "Lexicographical" or "K-Sortable". This is when you encode the time of action (usually creation) into the identifier, making a sorting by that field reflect actions in a chronological order. They behave like "sparse" serial identifiers, with gaps between values growing with time.
The easiest (but also naive) approach is to rely only on time, like using a Unix timestamp. Unless you are 100% sure that your system canβt have nearly simultaneous creations, avoid using just time; also, mind the year 2038 problem.
When well implemented, lexicographical identifiers can be a perfect fit for a distributed system: values can be safely created simultaneously by many instances, assuming it relies on extra information, like some βnodeβ identification (process, machine, datacenter, etc.), or on randomness; Its adoption by major social media companies proves its suitability for distributed systems.
Today, the most popular implementations are UUIDv7 (128-bit) and Snowflakes (64-bit); pick one of those.
Random
Random identifiers are, wellβ¦ random! They disassociate completely from the content of data and its time or sequence of events. Given enough entropy, they are very distributed-friendly, and are also very secure: unlike lexicographical keys, random ones donβt even βleakβ their time of creation.
Although older than that, UUIDv4 became the most popular implementation of a random identifier in the last 10 years or so, gaining support in all main languages, frameworks, and databases. This popularity helped highlight its main weakness: randomness doesnβt play well with database indexes! Well, they βworkβ, but are less performant and take more space than the index for a serial or lexicographical field would.
If the 128-bit of a UUIDv4 is too big for your case, there are options: for example, YouTube IDs are just a random 64-bit integer, encoded in Base64, and checked if a previous value is already in use during its time of creation; this amazing video by Tom Scott shows this technique quite well.
Overall, only use random keys if you really know what you are doing and your requirements only allow for this option; otherwise, pick a lexicographical identifier instead.
Content derived
This is when we identify the data by the hash of its content; the end result looks similar to random, but is actually deterministic, granting it some interesting properties.
Probably the most familiar case of this type developers encounter daily are git commit IDs, that are just the result of the SHA-1 function (or SHA-256 as an option) being applied over the changes and its metadata.
This is a very advanced technique, not recommended for beginners; but if you are confident in your skills, identifying by hash can give a boost of performance when dealing with value types, like complex objects that happen to be immutable, thus trivially cacheable. Also, be mindful of the space needed: good hashes take a lot of space; for example, the default SHA-1 used by git takes 160 bits, and this function by itself is already considered deprecated; newer variations of SHA-2 and SHA-3 can even reach 512 bits.
Hierarchical
Also known as "Composite" or "Namespaced". This is an umbrella-term to classify multi-level identifiers combining the types described above.
Notable examples are:
- Jira-style: usually natural+serial, example:
LOG-1234
:LOG
: logistics department,1234
: ticket number inside de department.
- ISBN-13: list of serials, used for books, example:
978-0-553-38257-0
:978-0
: language "english",553
: publisher "Random House Inc.",38257
: title, in this case, "Foundation", from Isaac Asimov,0
: checksum.
- UPC code, list of serials, used for supermarked goods (barcode), example:
7622201024277
:76
: region "Switzerland",2220
: manufacturer "Milka",102427
: product "milk chocolate bar, 150g",7
: checksum.
Remember that this is context-based: the same code can be hierarchical for the system creating it, and natural for another one referencing it.
Considerations
Security
Most of the identifier types leak some data; usually this is a tolerable outcome, offset by other advantages, but if you are in a situation that you really need to audit what gets exposed, here is a list:
- Natural: no effort to hide anything.
- Serial: leaks quantity and progression.
- Time sortable: leaks time.
- Random: fully opaque.
- Content derived: fully opaque.
- Hierarchical: depends on the type of its parts.
Performance
If performance is your priority, here is a list of considerations by type:
- Natural: depends, usually poor indexing.
- Serial: very good indexing.
- Time sortable: very good indexing, sharding by time is possible.
- Random: very bad indexing.
- Content derived: very bad indexing, heavy hash operation on creation, but facilitates caching.
- Hierarchical: depends, usually good indexing, sharding by level is possible.
The reason for serial and time sortable having good indexing is that databases rely on a B-Tree structure, that works quite well with increasing numerical values (that get just appended to the end), instead of "jumping around" to add random or hashed values.
Signed / unsigned integers
This is important to check for serial and time sortable types: most databases use signed integers as their native option for those keys, and since they only use positive numbers, that means half of possible serial values, and a drastically shorter time limit for lexicographical ones. The support for unsigned values can be emulated with a βdecimalβ type in most cases, but you could be entering into βfloatβ territory if you are not careful. The lack of support for unsigned integers doesnβt necessarily need to block a database utilization, just keep in mind that the limits of identifier you picked can be lower than expected.
This is not an issue for UUIDs (internally unsigned 128-bit integers); this is so common that all popular databases have a special field to handle them.
Pro Tips
- Look for modern alternatives for legacy fields and operations, for example: PostgreSQL created Identity Columns as an alternative to Serial Types. Those usually come with performance improvements.
- If possible, test how performant an implementation is with real data before committing to it.
Implementations
UUID
UUIDs, or "Universally Unique Identifiers", are 128-bit unsigned integers, standardized in the RFC-4122 (versions 1 to 5) and extended in the RFC-9562 (versions 6 to 8). They are usually displayed as 32 hexadecimal characters formatted in 8-4-4-4-12 segments, like this: 00000000-0000-0000-0000-000000000000
, but sometimes can appear without the dashes.
Their bits follow this layout:
Bit Layout:
βββ Hexa: βββ―βββββββββββββββββββ Bits: ββββββββββββββββββββ
β .... .... β .... .... .... .... .... .... .... .... β
β .... .... β .... .... .... .... vvvv .... .... .... β
β .... .... β tt.. .... .... .... .... .... .... .... β
β .... .... β .... .... .... .... .... .... .... .... β
βββββββββββββ·ββββββββββββββββββββββββββββββββββββββββββββββ
Type: tt, version: vvvv.
Example for 0198937f-2c11-7264-a318-02dfcb75944e:
βββ Hexa: βββ―βββββββββββββββββββ Bits: ββββββββββββββββββββ
β 0198 937f β 0000 0001 1001 1000 1001 0011 0111 1111 β
β 2c11 7264 β 0010 1100 0001 0001 β0111 0010 0110 0100 β
β a318 02df ββ1010 0011 0001 1000 0000 0010 1101 1111 β
β cb75 944e β 1100 1011 0111 0101 1001 0100 0100 1110 β
βββββββββββββ·ββββββββββββββββββββββββββββββββββββββββββββββ
Type: 10, verion: 0111 (7).
The bits for the types we want are always 10
; if they are different than that, they fall outside the scope of the 2 RFCs, like the custom ones used by NCS and Microsoft (they call it a "GUID").
Versions:
- This is a train wreck! It is technically time-based, but not lexicographical.
- Used for DCE Security, proprietary, out of scope of RFC-4122.
- MD5 hash.
- Random. βοΈ
- SHA-1 hash.
- Truly lexicographical (timestamp + node).
- Truly lexicographical (timestamp + random). βοΈ
- Open to custom implementations.
Iβm only explaining the versions for completion. The only versions you should actually care about are UUIDv4 (random) and UUIDv7 (lexicographical). If you are not sure what to use, default to UUIDv7.
UUID-compatible
Since UUIDs became popular and the databases created a special field to hold them, some implementations took advantage of that to create similar formats that could be encoded as the same unsigned 128-bit integer:
Format | Comment | Example |
---|---|---|
ULID | This became popular for a while before UUIDv7 got ratified, it fits the same need of timestamp+random. It is displayed as 26 characters encoded in Crockfordβs Base32. | 01JCXSGZMZQQJ2M93WC0T8KT02 |
UPID | This is a very interesting hybrid: it starts as timestamp+random, but ends with some bits dedicated to create a 4-letter word used as a namespace, making it both lexicographical and hierarchical! Like ULID, it is also displayed encoding in Crockfordβs Base32 but with a twist: the namespace appears at the beginning. | user_2accvpp5guht4dts56je5a |
SCRU128 | Exactly the same layout as ULID (timestamp+random), but displayed in Base36. | 03cwivkme1qhj3crprqujv4lu |
Timeflake | Another ULID clone, but encoded in Base62. | 02i2XhN7hAuaFh3MwztcMd |
TypeID | Used by Jetify; it is another ULID clone, but it is displayed in Base32 with a prefix that is not stored in the 128 bits; it needs to be stored in another column to recreate the rendered version. | prefix_01h2xcejqtf2nbrexx3vqjhp41 but prefix_ is not encoded. |
Flake | Used by BMC TrueSight Pulse (formerly Boundary); It is lexicographical (timestamp+node+sequence), and gets displayed in Base62. | 8HFaR8qWtRlGDHnO57 |
Datadog Trace ID | This is basically the Datadog developers using only the bits they need from the 128 available: it starts with 16 bits of timestamp, followed by 16 bits of zeroes, and finally 32 bits of randomness. It just gets displayed in hex, like normal UUIDs. | 6772800700000000d97a8af26532e259 |
Other than those, there are many projects with the same goal of rendering normal UUIDs with fewer characters. They all have their own combinations of encoding, alphabets, and rules. Looks like everyone is "rolling your own"!
Snowflake variants
This is a Lexicographical 64-bit integer open-sourced by Twitter (now X) in 2010; since then, many other companies and open projects started their own variations based on it.
It follows this structure:
- Timestamp: to build it, you'll need:
- Bit size: usually around 42 bits, but varies by version.
- Time "tick": usually measured in 1 millisecond, but other units could be used, like 10ms for Sony and 1s for Frostflake.
- Epoch: the Unix timestamp of the zero value. Normally each implementation picks a round-ish time near the beginning of their system; the Unix timestamp itself (
1970-01-01
) can sometimes be used, but that means wasting 55+ years that could limit the lifetime of the ID. - Zero bit (optional): some databases (like PostgreSQL and SQLite) only support signed 64-bit integers, that means reserving the first digit the sign, making only 63 bits usable for the rest of the Snowflake; that's why some reserve the first bit as always zero.
- Extra fields: usually has those elements (order and existence vary):
- Some kind of segmentation: this is how distributed systems avoid collisions for things that happen at different places in the exact timestamp. Names like "worker", "node", "generator" and "datacenter" are common.
- Sequence: another level of collision avoidance; if things happen fast enough in the same place during the same time "tick", this values gets incremented.
- Random: this is rare, the only example I could find of this was the TSID implemmentation.
Here are a few examples of how the variations layout their bits:
Twitter:
0 00000000000000000000000000000000000000000 0000000000 000000000000
x Timestamp ticks Worker Sequence
Frostflake:
00000000000000000000000000000000 000000000000000000000 00000000000
Timestamp ticks Sequence Generator
TSID:
000000000000000000000000000000000000000000 0000000000000000000000
Timestamp ticks Random
Implementation example:
Assuming a Twitter ID like 1777150623882019211:
In binary: 0 00110001010100110110101000100111010000011 0110100000 000110001011
| | | |
| | | Sequence: 395
| | Worker: 416
| Timestamp ticks: 423705726595
Always zero
Timestamp: (423705726595 * 1) + 1288834974657 = 1712540701252
| | | |
| | | Time: 2024-04-08T01:45:01.252Z
| | Epoch
| Tick duration: 1ms
Ticks since epoch
I was able to find all this versions "in the wild":
Variant | Expiration | Unix epoch (ms) | Time "tick" | Time bits | Extra's bits |
---|---|---|---|---|---|
2080-07-10 | 1288834974657 | 1ms | 1 + 41 | Worker: 10 Sequence: 12 | |
Discord | 2154-05-15 2084-09-06 (signed) | 1420070400000 | 1ms | 42 | Worker: 5 Process: 5 Sequence: 12 |
Mastodon | 10889-08-02 (!!!)6429-10-17 (signed) | 0 | 1ms | 48 | Sequence: 16 |
Meta (Instagram and Threads) | 2081-04-30 2046-06-27 (signed) | 1314220021721 | 1ms | 41 | Shard: 13 Sequence: 10 |
2039-09-07 | 0 | 1ms | 1 + 41 | Worker: 10 Sequence: 12 | |
Sony | 2188-11-16 | 1409529600000 | 10ms | 1 + 39 | Sequence: 8 Machine: 8 |
Spaceflake | 2084-09-06 | 1420070400000 | 1ms | 1 + 41 | Node: 5 Worker: 5 Sequence: 12 |
Frostflake | 2106-02-07 2038-01-19 (signed) | 0 | 1s | 32 | Sequence: 21 Generator: 11 |
Flake ID | 2109-05-15 2039-09-07 (signed) | 0 | 1ms | 42 | Datacenter: 5 Worker: 5 Ssequence: 12 |
TID (AT Protocol, Bluesky) | 2255-06-05 | 0 | 1s | 1 + 53 | Clock: 10 |
SCRU64 | 4261-02-27 | 0 | 256ms | 2 + 38 | Node: 24 |
TSID | 2159-05-15 2089-09-06 (signed) | 1577836800000 | 1ms | 42 | Random: 22 |
Some variations have the option of an alternative non-numerical representation, but those are just the same 64-bit integers encoded in various forms, usually to look short on URLs:
Variation | Characters | Encoding | Example |
---|---|---|---|
TID (Bluesky) | Fixed 13 | Base32 | 3lfegaoywdk2w |
Threads (Meta) | Up to 11 | Base64 | DEr_fXvuw6D |
Frostflake | Up to 11 | Base58 | JERHwh5PXjL |
SCRU64 | Fixed 12 | Base36 | 0v20wcjrb21p |
TSID | Fixed 13 | Base32 (Crockford's) | 0J4AEXRN106Z0 |
Recommendations:
- To make things simpler, I recommend the TSID approach (timestamp + random), this is kinda of a mini version of an UUIDv7.
- If you need to have a note + sequence, just pick what is available for your language, or just "roll your own" (it is not that hard).
- Whatever the pattern, try to pick an epoch that fits your case.
- Mind your database numeric type, if it only allows for signed integers, fine tune the bits and "tick" to make it last.
- Never change your implementation once picked.
Other
UUIDs and Snowflakes are not silver bullets; there will always be cases when you need to create your own type of identifier to fit your exclusive needs. Here is a list of formats like that, where the creators were kind enough to share their implementations:
Format | Type | Backed by | Encoding | Example |
---|---|---|---|---|
Xid | Time sortable | Integer: 96 bits | Base32hex | cst4p962941gd9baqg70 |
ObjectId (MongoDB) | Time sortable | Integer: 96 bits | Hex | 6592008029c8c3e4dc76256c |
KSUID | Time sortable | Integer: 160 bits | Base64 | 1HCpXwx2EK9oYluWbacgeCnFcLf |
CUID1 | Time sortable | ASCII: 25 chars | Base36 | cm3xemk9o00070cm7ghnl6toe |
Puid | Time sortable | ASCII: 12, 14 or 24 chars | Base36 | he5fps6l2504cd1w3ag8ut8e |
PushID (Firebase) | Time sortable | ASCII: 20 chars | Base64 | -OFrJ24CPTXLcIPPjvh3 |
Sqids / Hashids (old) | Hierarchical: list of integers | ASCII: variable | Custom | 86Rf07xd4z / 80JTEquWr |
Stripe ID | Hierarchical: natural + random | ASCII: variable | Custrom | cus_lO1DEQWBbQAACfHO |
NUID (NATS) | Hierarchical: random + serial | ASCII: 22 chars | Base62 | EQyuCsA4ysv7ezXReOrk4i |
CUID2 | Random | ASCII: 24 chars | Base36 | byab6ewccgwheoshq1wk9hds |
Nano ID | Random | ASCII: usually 21 chars | Custom | XBCdxzsCR2FEFeSwhnjCo |
Breeze ID | Random | ASCII: 19 chars | Custom | 9NU6-XQLZ-BDIH-6HKE |
Alphadec | Time sortable | Time | Custom | 2025_L0V3_000000 |
Conclusion
As you can see, this subject is very deceiving: it looks simple and well-established, but is actually very deep, with lots of nuance to consider! But, if you want a very short answer, here is the abridged version:
- If you have a candidate for a natural key, make sure it is unique, universal and stable; if you are not sure, pick other type.
- If you are sure your system will never have a need to be distributed, pick a serial; they are basically βset and forgetβ.
- If you know or suspect your system may need to be distributed, pick one of the time sortable options. UUIDv7 is the most straightforward one, but, if you trust your skills and want to save some space, pick one of the Snowflake variants or use them as a basis for a custom variant of your own.
- If you are in a situation that you canβt even leak the creation time of your rows (be reasonable, not paranoid), then pick a random identifier, like UUIDv4.
- If you have a case of complex immutable objects, consider the content derived (hash) approach.
- If you have to integrate many systems or are creating an industry standard: first: try not to reinvent the wheel, look for existing options; if you still need a custom case, consider creating a hierarchical identifier.
I hope this article has been useful. This kind of topic is usually not deeply discussed academically, and in the industry it is common to just go for the familiar patterns. Now you can make a thoughtful and considerate decision next time you model your data.