Thread: Sequence vs UUID

Sequence vs UUID

From
veem v
Date:

Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention. 

So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

Re: Sequence vs UUID

From
Christophe Pettus
Date:

> On Jan 26, 2023, at 11:17, veem v <veema0000@gmail.com> wrote:
> So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons
listavailable for each of those to understand the exact scenario in which we should go for one over other? 

Clear rules are a bit difficult to come by here, but there are general guidelines.

First, the distinction isn't exactly UUIDs vs sequences.  There are two distinctions:

1. UUIDs vs bigints.
2. Sequential values vs random values.

The advantage of bigints vs UUIDs is that bigints will faster for PostgreSQL to process in a variety of ways, as well
asbeing half the size of a UUID (so, smaller tables, smaller indexes, etc.).  The main advantage of UUIDs vs bigints
is,if you are using random values, there's existing easy-to-use infrastructure for generating UUIDs vs generating
randombigints. 

The advantage of sequential values is that they interact much better with caching than random values.  Random values
willhave a harder time maintaining a reasonable in-memory working set than sequential values.  (Of course, if the
databasefits entirely in memory, this isn't as much of an issue).  The advantage of random values is that they are more
opaque;if there is a 123480102 in a sequential key, an attacker can be confident there's also a 123480103, which a
randomvalue avoids.  There are algorithms for generating sequential values that avoid this by having hard-to-guess less
significantdigits. 

Another advantage of sequential values is that they are (roughly) time-ordered, so they can be used to get "most
recent"efficiently. 

One concern about sequential values that generally is not a real issue is the bottleneck of creating new sequential
values. The sequence functionality in PostgreSQL is very concurrency-friendly. 

UUIDs can be generated in such a way that they have sequential properties; see:

    https://github.com/tvondra/sequential-uuids





Re: Sequence vs UUID

From
Merlin Moncure
Date:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:

Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention. 

So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work around that
* UUIDS are basically random data causing page fragmentation.  this is particularly bad in auto clustering architectures like sql server 
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges of records on large tables.   

merlin

 

Re: Sequence vs UUID

From
Rob Sargent
Date:
On 1/26/23 14:36, Merlin Moncure wrote:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:

Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention. 

So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work around that
* UUIDS are basically random data causing page fragmentation.  this is particularly bad in auto clustering architectures like sql server 
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges of records on large tables.   

merlin

 
"Sweeping ranges of records" by ID suggests you have information in your id.  If you're  comfortable with sequence generation as a surrogate for time-point of entry, by all means have at it.

For many of us, contemporaneously generated records have nothing to do with each other. (And they carry attributes which groups them.)

UUIDs do a very good job of avoiding id-overlap across domains (compare to a sequence for each domain).



Re: Sequence vs UUID

From
Erik Wienhold
Date:
> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
>
> Hello, We were trying to understand whether we should use UUID or Sequence in
> general for primary keys. In many of the blogs (one is below) across multiple
> databases, I saw over the internet and all are mostly stating the sequence is
> better as compared to UUID. But I think in the case of concurrent data load
> scenarios UUID will spread the contention point whereas sequence can be a
> single point of contention.
>
> So we want to understand from experts here, if there are any clear rules
> available or if we have any pros vs cons list available for each of those to
> understand the exact scenario in which we should go for one over other?
> Basically I wanted to see if we can perform some test on sample data to see
> the percentage of overhead on read and write performances of the query in
> presence of UUID VS Sequence to draw some conclusion in general? And also
> considering open source postgres as the base for many databases like redshift
> etc, so the results which apply to progress would apply to others as well.
>
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

I think that[1] provides a good summary.  Performance consideration is just one
aspect.  Is there a technical requirement for using UUID over sequential values?

If there's a single generator of primary keys use bigint sequences.  In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.

There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2].  The first I can understand: you may
not want to reveal the number of users or customers.  But access control should
prevent forced browsing.

[1] https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2] https://owasp.org/www-community/attacks/Forced_browsing

--
Erik



Re: Sequence vs UUID

From
Benedict Holland
Date:
You could always create a uuid matching table for anything displayed to users and keep a private ID for anything internal. From my particle standpoint, one is 8 bytes, the other is 16 or 32. Any database implementation should guarantee a unique value. I have had cases where it didn't work but those were rare and possibly a coding error. Basically, i don't see a particle difference apart from url access and giving away a PK. I haven't had an issue with it. Some people have. I still use internally but my requirements are not public facing.

Thanks,
Ben



On Thu, Jan 26, 2023, 4:55 PM Erik Wienhold <ewie@ewie.name> wrote:
> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
>
> Hello, We were trying to understand whether we should use UUID or Sequence in
> general for primary keys. In many of the blogs (one is below) across multiple
> databases, I saw over the internet and all are mostly stating the sequence is
> better as compared to UUID. But I think in the case of concurrent data load
> scenarios UUID will spread the contention point whereas sequence can be a
> single point of contention.
>
> So we want to understand from experts here, if there are any clear rules
> available or if we have any pros vs cons list available for each of those to
> understand the exact scenario in which we should go for one over other?
> Basically I wanted to see if we can perform some test on sample data to see
> the percentage of overhead on read and write performances of the query in
> presence of UUID VS Sequence to draw some conclusion in general? And also
> considering open source postgres as the base for many databases like redshift
> etc, so the results which apply to progress would apply to others as well.
>
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

I think that[1] provides a good summary.  Performance consideration is just one
aspect.  Is there a technical requirement for using UUID over sequential values?

If there's a single generator of primary keys use bigint sequences.  In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.

There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2].  The first I can understand: you may
not want to reveal the number of users or customers.  But access control should
prevent forced browsing.

[1] https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2] https://owasp.org/www-community/attacks/Forced_browsing

--
Erik


Re: Sequence vs UUID

From
Merlin Moncure
Date:
On Thu, Jan 26, 2023 at 3:50 PM Rob Sargent <robjsargent@gmail.com> wrote:
On 1/26/23 14:36, Merlin Moncure wrote:
On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:

Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention. 

So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.

https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/

* UUIDs are big for identifer (16 bytes) sequence is 4 or 8
* sequences are NOT a contention point, database uses some tricks to work around that
* UUIDS are basically random data causing page fragmentation.  this is particularly bad in auto clustering architectures like sql server 
* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges of records on large tables.   

"Sweeping ranges of records" by ID suggests you have information in your id.  If you're  comfortable with sequence generation as a surrogate for time-point of entry, by all means have at it.

Yeah, sequences preserve time locality.   Clustering the table on the primary key would then optimize certain cases.   Clustering is useless on guids.  Sequences do better here and hybrid natural key models can do better still.  Buffer hit ratio and read amplification can be major issues if not understood and controlled for.
 
For many of us, contemporaneously generated records have nothing to do with each other. (And they carry attributes which groups them.)

UUIDs do a very good job of avoiding id-overlap across domains (compare to a sequence for each domain).

 While true, there are other solutions to the issue, one or more of,
* separating id generation from storage
* reserving id ranges
* creating domain id composite alongside generated id

Relying solely on guid to generate uniqueness seems like a broken model anyways in the general case.   Whatever thing is responsible for suppressing information conflicts ought to be emitting the id.

merlin

Re: Sequence vs UUID

From
Ron
Date:
On 1/26/23 15:55, Erik Wienhold wrote:
>> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
>>
>> Hello, We were trying to understand whether we should use UUID or Sequence in
>> general for primary keys. In many of the blogs (one is below) across multiple
>> databases, I saw over the internet and all are mostly stating the sequence is
>> better as compared to UUID. But I think in the case of concurrent data load
>> scenarios UUID will spread the contention point whereas sequence can be a
>> single point of contention.
>>
>> So we want to understand from experts here, if there are any clear rules
>> available or if we have any pros vs cons list available for each of those to
>> understand the exact scenario in which we should go for one over other?
>> Basically I wanted to see if we can perform some test on sample data to see
>> the percentage of overhead on read and write performances of the query in
>> presence of UUID VS Sequence to draw some conclusion in general? And also
>> considering open source postgres as the base for many databases like redshift
>> etc, so the results which apply to progress would apply to others as well.
>>
>> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
> I think that[1] provides a good summary.  Performance consideration is just one
> aspect.  Is there a technical requirement for using UUID over sequential values?
>
> If there's a single generator of primary keys use bigint sequences.  In case of
> multiple generators (multi-master replication, sharding, clients generating IDs)
> consider UUID.
>
> There are arguments against sequential PK, e.g. they give away too much info and
> allow attacks such as forced browsing[2].  The first I can understand: you may
> not want to reveal the number of users or customers.  But access control should
> prevent forced browsing.

Shouldn't your application layer isolate the users from the database?  UUIDs 
are all over the DBs I manage, but the PKs are all sequences.

-- 
Born in Arizona, moved to Babylonia.



Re: Sequence vs UUID

From
G Hasse
Date:
Hello.

I have been using UUID for quite a long time now. The reason I began to use UUID was the need to be able to move data
between
databases and the need to create record outside the database. You should use UUID as a primary key for a record and
alsohave
 
some bookkeeping UUID:s in the record like "origin" of the record. In this way it is "easy" to handle different sources
of
data.

We have also written some nice replicating software on this basis (postsync) that can check for alterations in one
database
and update others. In this way we can keep one or many replicas of databases.

So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with
UUID
is so mutch better. Sequence keys are a terrible idea!

// GH


Den 2023-01-26 kl. 20:17, skrev veem v:
> Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of
theblogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is
betteras compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention
pointwhereas sequence can be a single point of contention.
 
> 
> So we want to understand from experts here, if there are anyclear rules available or if we haveany pros vs cons list
availablefor each of those to understand the exact scenario in which we should go for one over other? Basically I
wantedtosee if we can perform some test on sample data to see the percentage of overhead on read and write performances
ofthe query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source
postgres asthe base for many databases like redshift etc, so the results which apply to progress would apply to others
aswell.
 
> 
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
<https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/>
> 

-- 
gorhas@raditex.nu
http://www.raditex.nu
Raditex Control AB
Bo 229, 715 91 ODENSBACKEN
Mob: 070-5530148



Re: Sequence vs UUID

From
Rob Sargent
Date:
> So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with
UUID
> is so mutch better. Sequence keys are a terrible idea!
>
> // GH
>
Wow. I am not alone
>



Re: Sequence vs UUID

From
Erik Wienhold
Date:
> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much info and
> > allow attacks such as forced browsing[2].  The first I can understand: you may
> > not want to reveal the number of users or customers.  But access control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the database?  UUIDs
> are all over the DBs I manage, but the PKs are all sequences.

Yes, I meant the application layer, not Postgres' access control.

--
Erik



Re: Sequence vs UUID

From
Benedict Holland
Date:
Why is it a terrible idea? I have been using them for years without a single problem. I don't rely on them for create order. Terrible seem a bit extreme.

Thanks,
Ben

On Sat, Jan 28, 2023, 3:39 PM Erik Wienhold <ewie@ewie.name> wrote:
> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much info and
> > allow attacks such as forced browsing[2].  The first I can understand: you may
> > not want to reveal the number of users or customers.  But access control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the database?  UUIDs
> are all over the DBs I manage, but the PKs are all sequences.

Yes, I meant the application layer, not Postgres' access control.

--
Erik


Re: Sequence vs UUID

From
Ron
Date:
Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor.

This lesson was burned into my psyche waaaay back in the Clinton administration.  It was my task to speed up a five hour batch job which read input records from a flat file, did some validations and then inserted them.  Simply sorting the input file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours.  (VMS SORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record at the top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.)

On 1/28/23 19:44, Benedict Holland wrote:
Why is it a terrible idea? I have been using them for years without a single problem. I don't rely on them for create order. Terrible seem a bit extreme.

Thanks,
Ben

On Sat, Jan 28, 2023, 3:39 PM Erik Wienhold <ewie@ewie.name> wrote:
> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much info and
> > allow attacks such as forced browsing[2].  The first I can understand: you may
> > not want to reveal the number of users or customers.  But access control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the database?  UUIDs
> are all over the DBs I manage, but the PKs are all sequences.

Yes, I meant the application layer, not Postgres' access control.

--
Erik



--
Born in Arizona, moved to Babylonia.

Re: Sequence vs UUID

From
Miles Elam
Date:
On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor.
>
> This lesson was burned into my psyche waaaay back in the Clinton administration.  It was my task to speed up a five
hourbatch job which read input records from a flat file, did some validations and then inserted them.  Simply sorting
theinput file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours.
(VMSSORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record
atthe top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.) 

This can be mitigated with judicious use of a sequence at the front of
the uuidv4.

    https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/

More effort than just calling the built-in gen_random_uuid() or
equivalent in app code, but a substantial performance gain for your
effort.

    https://github.com/tvondra/sequential-uuids

And in a managed environment where you can't install custom
extensions, a fairly simple function with divide on unix epoch seconds
combined with a call to overlay(...) should suffice performance-wise.
At 60 seconds, this will loop every 45 days or so, and you can choose
how much "sequentialness" works for you, from 1 to 4 bytes at the
expense of pseudo-randomness.

-----------------------------

-- Generate time interval UUID
CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
60, block_num_bytes int2 = 2)
RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
  SELECT encode(
    overlay(
      -- convert the uuid to byte array
      uuid_send(gen_random_uuid())
      -- use only the bottom bytes
      PLACING substring(
          int4send((extract(epoch FROM now()) / interval_seconds)::int4)
          FROM (5 - block_num_bytes)
      )
      -- place at the front two bytes of the uuid
      FROM 1
    )
    -- convert the resulting byte array to hex for conversion to uuid
    , 'hex')::uuid
  WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
$$;

-----------------------------

Technically affecting the v4 spec. You could always convert to a
UUIDv8, which is the intention behind that new version even though the
standard hasn't been ratified yet.


Cheers,

Miles Elam



Re: Sequence vs UUID

From
Ron
Date:
Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Also, should now() be replaced by clock_timestamp(), so that it can be 
called multiple times in the same transaction?

On 1/28/23 21:28, Miles Elam wrote:
> On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>> Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor.
>>
>> This lesson was burned into my psyche waaaay back in the Clinton administration.  It was my task to speed up a five
hourbatch job which read input records from a flat file, did some validations and then inserted them.  Simply sorting
theinput file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours.
(VMSSORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record
atthe top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.)
 
> This can be mitigated with judicious use of a sequence at the front of
> the uuidv4.
>
>      https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/
>
> More effort than just calling the built-in gen_random_uuid() or
> equivalent in app code, but a substantial performance gain for your
> effort.
>
>      https://github.com/tvondra/sequential-uuids
>
> And in a managed environment where you can't install custom
> extensions, a fairly simple function with divide on unix epoch seconds
> combined with a call to overlay(...) should suffice performance-wise.
> At 60 seconds, this will loop every 45 days or so, and you can choose
> how much "sequentialness" works for you, from 1 to 4 bytes at the
> expense of pseudo-randomness.
>
> -----------------------------
>
> -- Generate time interval UUID
> CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 =
> 60, block_num_bytes int2 = 2)
> RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$
>    SELECT encode(
>      overlay(
>        -- convert the uuid to byte array
>        uuid_send(gen_random_uuid())
>        -- use only the bottom bytes
>        PLACING substring(
>            int4send((extract(epoch FROM now()) / interval_seconds)::int4)
>            FROM (5 - block_num_bytes)
>        )
>        -- place at the front two bytes of the uuid
>        FROM 1
>      )
>      -- convert the resulting byte array to hex for conversion to uuid
>      , 'hex')::uuid
>    WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4
> $$;
>
> -----------------------------
>
> Technically affecting the v4 spec. You could always convert to a
> UUIDv8, which is the intention behind that new version even though the
> standard hasn't been ratified yet.
>
>
> Cheers,
>
> Miles Elam
>
>

-- 
Born in Arizona, moved to Babylonia.



Re: Sequence vs UUID

From
Miles Elam
Date:
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.

> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?

Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.



Re: Sequence vs UUID

From
veem v
Date:
I have a question, As i understand here, usage wise there are multiple benefits of UUID over sequences like, in case of distributed app where we may not be able to rely on one point generator like sequences, in case of multi master architecture, sharding. 

If we just look in terms of performance wise, the key advantage of sequence is that for read queries, because of the storage size it will be smaller and thus it will cache more index rows and so will be beneficial during read queries and should also be beneficial even on joins because of its smaller size. Also fetching a value from sequence is cheaper than calculating the UUIDS. But the downside is during write operation, it can be a point of contention in case of concurrent data load as every incoming request will try to modify same table/index page/block. But as its mentioned in this blog (https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state this UUID can be made sequential so even this can be sequential using prefix-timestamp etc. However isn't it that making the UUID sequential will again actually be a disadvantage and can be contention point for this unique index as each incoming write will now fight for same block/page while doing concurrent data load and will contend for the same table block or say one side of the index branch/leaf block etc, whereas in case of random UUIDs the write was spreading across multiple blocks so there was no contention on any specific blocks? Please correct if my understanding is wrong? 



On Sun, 29 Jan, 2023, 10:33 am Miles Elam, <miles.elam@productops.com> wrote:
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.

> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?

Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.


Re: Sequence vs UUID

From
veem v
Date:
Was trying to test the performance for simple read/write for the bigint vs UUID. What we see is , ~3 times performance degradation while joining on bigint vs UUID columns. Also even just generation of sequence vs bigint itself is degrading by ~3times too. Also even insert performance on same table for ~10million rows is ~1min 39sec for bigint vs ~3minute 11 sec in case of UUID. Is such extent of degradation in performance this expected for UUID? 

CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );
CREATE TABLE test2_UUID (id bigint,source_id varchar(36) PRIMARY KEY,Name varchar(20) );

CREATE TABLE test1_bigint ( id bigint PRIMARY KEY, source_id varchar(36) , Name varchar(20));
CREATE TABLE test2_bigint ( id bigint PRIMARY KEY, source_id varchar(36) , Name varchar(20));

Loaded same 10million rows. 

explain Analyze select * from test1_bigint a , test2_bigint b where a.id = b.id

Merge Join (cost=12.31..875534.52 rows=10000021 width=100) (actual time=0.042..6974.575 rows=10000000 loops=1)
  Merge Cond: (a.id = b.id)
  -> Index Scan using test1_bigint_pkey on test1_bigint a (cost=0.43..362780.75 rows=10000021 width=50) (actual time=0.020..2070.079 rows=10000000 loops=1)
  -> Index Scan using test2_bigint_2_pkey on test2_bigint b (cost=0.43..362780.75 rows=10000021 width=50) (actual time=0.019..2131.086 rows=10000000 loops=1)
Planning Time: 0.207 ms
Execution Time: 7311.210 ms

set enable_seqscan=off;

explain Analyze select * from test1_UUID a , test2_UUID b where a.source_id = b.source_id;

Merge Join (cost=2.75..2022857.05 rows=10000021 width=100) (actual time=0.043..21954.213 rows=10000000 loops=1)
  Merge Cond: ((a.source_id)::text = (b.source_id)::text)
  -> Index Scan using test1_uuid_pkey on test1_UUID a (cost=0.56..936420.18 rows=10000021 width=50) (actual time=0.022..7854.143 rows=10000000 loops=1)
  -> Index Scan using test2_uuid_2_pkey on test2_UUID b (cost=0.56..936437.90 rows=10000021 width=50) (actual time=0.017..7971.187 rows=10000000 loops=1)
Planning Time: 0.516 ms
Execution Time: 22292.801 ms

**********

create sequence myseq cache 32767;

select count(nextval('myseq') ) from generate_series(1,10000000)
 1 row retrieved starting from 1 in 4 s 521 ms (execution: 4 s 502 ms, fetching: 19 ms)

 select count(gen_random_uuid()) from generate_series(1,10000000)
 1 row retrieved starting from 1 in 11 s 145 ms (execution: 11 s 128 ms, fetching: 17 ms)



On Mon, 30 Jan, 2023, 4:59 pm veem v, <veema0000@gmail.com> wrote:
I have a question, As i understand here, usage wise there are multiple benefits of UUID over sequences like, in case of distributed app where we may not be able to rely on one point generator like sequences, in case of multi master architecture, sharding. 

If we just look in terms of performance wise, the key advantage of sequence is that for read queries, because of the storage size it will be smaller and thus it will cache more index rows and so will be beneficial during read queries and should also be beneficial even on joins because of its smaller size. Also fetching a value from sequence is cheaper than calculating the UUIDS. But the downside is during write operation, it can be a point of contention in case of concurrent data load as every incoming request will try to modify same table/index page/block. But as its mentioned in this blog (https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state this UUID can be made sequential so even this can be sequential using prefix-timestamp etc. However isn't it that making the UUID sequential will again actually be a disadvantage and can be contention point for this unique index as each incoming write will now fight for same block/page while doing concurrent data load and will contend for the same table block or say one side of the index branch/leaf block etc, whereas in case of random UUIDs the write was spreading across multiple blocks so there was no contention on any specific blocks? Please correct if my understanding is wrong? 



On Sun, 29 Jan, 2023, 10:33 am Miles Elam, <miles.elam@productops.com> wrote:
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.

Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.

> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?

Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.


Re: Sequence vs UUID

From
Dominique Devienne
Date:
On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );

Maybe if you used a "native" 16-byte uuid, instead of its textual
serialization with dashes (36 bytes + length overhead), the gap would
narrow.



Re: Sequence vs UUID

From
Tom Lane
Date:
Dominique Devienne <ddevienne@gmail.com> writes:
> On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
>> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );

> Maybe if you used a "native" 16-byte uuid, instead of its textual
> serialization with dashes (36 bytes + length overhead), the gap would
> narrow.

Yeah, especially if your database is not using C locale.  The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.

            regards, tom lane



Re: Sequence vs UUID

From
veem v
Date:
Thank You So much for the details. I am a bit new to postgres. And these test results I picked were from a dev system. If I understand it correctly, do you mean these settings(usage of C locale or "native" 16-byte uuid) which you mentioned should be there in a production system  and thus we should test the performance of the UUID vs sequence on a similar setup? Or say if this sort of degradation of UUID performance is not expected then , how to get these settings tweaked ON, so as to see the best string type or UUID performance, can you please guide me here? 

On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dominique Devienne <ddevienne@gmail.com> writes:
> On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
>> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );

> Maybe if you used a "native" 16-byte uuid, instead of its textual
> serialization with dashes (36 bytes + length overhead), the gap would
> narrow.

Yeah, especially if your database is not using C locale.  The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.

                        regards, tom lane

Re: Sequence vs UUID

From
Adrian Klaver
Date:
On 1/30/23 11:43, veem v wrote:
> Thank You So much for the details. I am a bit new to postgres. And these 
> test results I picked were from a dev system. If I understand it 
> correctly, do you mean these settings(usage of C locale or "native" 
> 16-byte uuid) which you mentioned should be there in a production system 
>   and thus we should test the performance of the UUID vs sequence on a 
> similar setup? Or say if this sort of degradation of UUID performance is 
> not expected then , how to get these settings tweaked ON, so as to see 
> the best string type or UUID performance, can you please guide me here?

No what is being said is change:

source_id varchar(36)

to

source_id uuid

as i:

https://www.postgresql.org/docs/current/datatype-uuid.html

> 
> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us 
> <mailto:tgl@sss.pgh.pa.us>> wrote:
> 
>     Dominique Devienne <ddevienne@gmail.com
>     <mailto:ddevienne@gmail.com>> writes:
>      > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>     <mailto:veema0000@gmail.com>> wrote:
>      >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>     PRIMARY KEY, Name varchar(20) );
> 
>      > Maybe if you used a "native" 16-byte uuid, instead of its textual
>      > serialization with dashes (36 bytes + length overhead), the gap would
>      > narrow.
> 
>     Yeah, especially if your database is not using C locale.  The
>     strcoll or ICU-based comparisons done on string types can be
>     enormously more expensive than the memcmp() used for binary
>     types like native uuid.
> 
>                              regards, tom lane
> 

-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: Sequence vs UUID

From
Ron
Date:
And populate that column with UUIDs generated by the gen_random_uuid() function.

(Requires v13.)

On 1/30/23 13:46, Adrian Klaver wrote:
> On 1/30/23 11:43, veem v wrote:
>> Thank You So much for the details. I am a bit new to postgres. And these 
>> test results I picked were from a dev system. If I understand it 
>> correctly, do you mean these settings(usage of C locale or "native" 
>> 16-byte uuid) which you mentioned should be there in a production system 
>>   and thus we should test the performance of the UUID vs sequence on a 
>> similar setup? Or say if this sort of degradation of UUID performance is 
>> not expected then , how to get these settings tweaked ON, so as to see 
>> the best string type or UUID performance, can you please guide me here?
>
> No what is being said is change:
>
> source_id varchar(36)
>
> to
>
> source_id uuid
>
> as i:
>
> https://www.postgresql.org/docs/current/datatype-uuid.html
>
>>
>> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us 
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>>     Dominique Devienne <ddevienne@gmail.com
>>     <mailto:ddevienne@gmail.com>> writes:
>>      > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>>     <mailto:veema0000@gmail.com>> wrote:
>>      >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>>     PRIMARY KEY, Name varchar(20) );
>>
>>      > Maybe if you used a "native" 16-byte uuid, instead of its textual
>>      > serialization with dashes (36 bytes + length overhead), the gap would
>>      > narrow.
>>
>>     Yeah, especially if your database is not using C locale. The
>>     strcoll or ICU-based comparisons done on string types can be
>>     enormously more expensive than the memcmp() used for binary
>>     types like native uuid.
>>
>>                              regards, tom lane
>>
>

-- 
Born in Arizona, moved to Babylonia.



Re: Sequence vs UUID

From
veem v
Date:
I tried to test quickly below on dbfiddle, below with the UUID as data type and in each of the below cases the UUID performance seems drastically reduced as compared to sequence performance. Let me know if anything is wrong in my testing here?

1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from ~.3ms to .5ms.


 PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit

CREATE UNLOGGED TABLE test_bigint (  id bigint  PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid (   id uuid  PRIMARY KEY);
create sequence myseq cache 32767;

************ sequence generation vs UUID generation Test**************
explain analyze  select count(nextval('myseq') ) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=291.030..291.030 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=53.332..63.941 rows=100000 loops=1)
Planning Time: 0.155 ms
Execution Time: 291.719 ms

explain analyze  select count(gen_random_uuid()) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=5654.453..5654.454 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=84.328..514.214 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 5655.158 ms

*************** Insert Test***************

explain analyze INSERT INTO test_bigint select nextval('myseq')  from generate_series(1,100000);

QUERY PLAN
Insert on test_bigint  (cost=0.00..2250.00 rows=0 width=0) (actual time=2030.960..2030.961 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=8) (actual time=48.102..636.311 rows=100000 loops=1)
Planning Time: 0.065 ms
Execution Time: 2031.469 ms

explain analyze  INSERT INTO test_uuid  select gen_random_uuid()  from generate_series(1,100000);


QUERY PLAN
Insert on test_uuid  (cost=0.00..2250.00 rows=0 width=0) (actual time=10599.230..10599.230 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=16) (actual time=36.975..6289.811 rows=100000 loops=1)
Planning Time: 0.056 ms
Execution Time: 10599.805 ms

************ Index performance********

Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100);

QUERY PLAN
Nested Loop  (cost=2.98..734.71 rows=100 width=8) (actual time=0.083..0.269 rows=100 loops=1)
  ->  HashAggregate  (cost=2.69..3.69 rows=100 width=8) (actual time=0.046..0.062 rows=100 loops=1)
        Group Key: test_bigint_1.id
        Batches: 1  Memory Usage: 24kB
        ->  Limit  (cost=0.00..1.44 rows=100 width=8) (actual time=0.011..0.025 rows=100 loops=1)
              ->  Seq Scan on test_bigint test_bigint_1  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017 rows=100 loops=1)
  ->  Index Only Scan using test_bigint_pkey on test_bigint  (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100)
        Index Cond: (id = test_bigint_1.id)
        Heap Fetches: 100
Planning Time: 0.279 ms
Execution Time: 0.302 ms

Explain analyze  select * from test_uuid where  id in (select id from test_uuid limit 100);

QUERY PLAN
Nested Loop  (cost=3.21..783.31 rows=100 width=16) (actual time=0.080..0.474 rows=100 loops=1)
  ->  HashAggregate  (cost=2.79..3.79 rows=100 width=16) (actual time=0.046..0.066 rows=100 loops=1)
        Group Key: test_uuid_1.id
        Batches: 1  Memory Usage: 24kB
        ->  Limit  (cost=0.00..1.54 rows=100 width=16) (actual time=0.010..0.025 rows=100 loops=1)
              ->  Seq Scan on test_uuid test_uuid_1  (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.009..0.016 rows=100 loops=1)
  ->  Index Only Scan using test_uuid_pkey on test_uuid  (cost=0.42..7.79 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=100)
        Index Cond: (id = test_uuid_1.id)
        Heap Fetches: 100
Planning Time: 0.180 ms
Execution Time: 0.510 ms

On Tue, 31 Jan 2023 at 03:28, Ron <ronljohnsonjr@gmail.com> wrote:

And populate that column with UUIDs generated by the gen_random_uuid() function.

(Requires v13.)

On 1/30/23 13:46, Adrian Klaver wrote:
> On 1/30/23 11:43, veem v wrote:
>> Thank You So much for the details. I am a bit new to postgres. And these
>> test results I picked were from a dev system. If I understand it
>> correctly, do you mean these settings(usage of C locale or "native"
>> 16-byte uuid) which you mentioned should be there in a production system
>>   and thus we should test the performance of the UUID vs sequence on a
>> similar setup? Or say if this sort of degradation of UUID performance is
>> not expected then , how to get these settings tweaked ON, so as to see
>> the best string type or UUID performance, can you please guide me here?
>
> No what is being said is change:
>
> source_id varchar(36)
>
> to
>
> source_id uuid
>
> as i:
>
> https://www.postgresql.org/docs/current/datatype-uuid.html
>
>>
>> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>>     Dominique Devienne <ddevienne@gmail.com
>>     <mailto:ddevienne@gmail.com>> writes:
>>      > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>>     <mailto:veema0000@gmail.com>> wrote:
>>      >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>>     PRIMARY KEY, Name varchar(20) );
>>
>>      > Maybe if you used a "native" 16-byte uuid, instead of its textual
>>      > serialization with dashes (36 bytes + length overhead), the gap would
>>      > narrow.
>>
>>     Yeah, especially if your database is not using C locale. The
>>     strcoll or ICU-based comparisons done on string types can be
>>     enormously more expensive than the memcmp() used for binary
>>     types like native uuid.
>>
>>                              regards, tom lane
>>
>

--
Born in Arizona, moved to Babylonia.


Re: Sequence vs UUID

From
Kirk Wolak
Date:


On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:
I tried to test quickly below on dbfiddle, below with the UUID as data type and in each of the below cases the UUID performance seems drastically reduced as compared to sequence performance. Let me know if anything is wrong in my testing here?

1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from ~.3ms to .5ms.


Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).

This article explains in detail what is going on.  If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.

Of course, YMMV...  And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...

Also, I hold out hope that one of the Gods of PostgreSQL on this list might give us an internal ULID implementation fixing that last problem...

HTH

 

 PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit

CREATE UNLOGGED TABLE test_bigint (  id bigint  PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid (   id uuid  PRIMARY KEY);
create sequence myseq cache 32767;

************ sequence generation vs UUID generation Test**************
explain analyze  select count(nextval('myseq') ) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=291.030..291.030 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=53.332..63.941 rows=100000 loops=1)
Planning Time: 0.155 ms
Execution Time: 291.719 ms

explain analyze  select count(gen_random_uuid()) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=5654.453..5654.454 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=84.328..514.214 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 5655.158 ms

*************** Insert Test***************

explain analyze INSERT INTO test_bigint select nextval('myseq')  from generate_series(1,100000);

QUERY PLAN
Insert on test_bigint  (cost=0.00..2250.00 rows=0 width=0) (actual time=2030.960..2030.961 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=8) (actual time=48.102..636.311 rows=100000 loops=1)
Planning Time: 0.065 ms
Execution Time: 2031.469 ms

explain analyze  INSERT INTO test_uuid  select gen_random_uuid()  from generate_series(1,100000);


QUERY PLAN
Insert on test_uuid  (cost=0.00..2250.00 rows=0 width=0) (actual time=10599.230..10599.230 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=16) (actual time=36.975..6289.811 rows=100000 loops=1)
Planning Time: 0.056 ms
Execution Time: 10599.805 ms

************ Index performance********

Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100);

QUERY PLAN
Nested Loop  (cost=2.98..734.71 rows=100 width=8) (actual time=0.083..0.269 rows=100 loops=1)
  ->  HashAggregate  (cost=2.69..3.69 rows=100 width=8) (actual time=0.046..0.062 rows=100 loops=1)
        Group Key: test_bigint_1.id
        Batches: 1  Memory Usage: 24kB
        ->  Limit  (cost=0.00..1.44 rows=100 width=8) (actual time=0.011..0.025 rows=100 loops=1)
              ->  Seq Scan on test_bigint test_bigint_1  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017 rows=100 loops=1)
  ->  Index Only Scan using test_bigint_pkey on test_bigint  (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100)
        Index Cond: (id = test_bigint_1.id)
        Heap Fetches: 100
Planning Time: 0.279 ms
Execution Time: 0.302 ms

Explain analyze  select * from test_uuid where  id in (select id from test_uuid limit 100);

QUERY PLAN
Nested Loop  (cost=3.21..783.31 rows=100 width=16) (actual time=0.080..0.474 rows=100 loops=1)
  ->  HashAggregate  (cost=2.79..3.79 rows=100 width=16) (actual time=0.046..0.066 rows=100 loops=1)
        Group Key: test_uuid_1.id
        Batches: 1  Memory Usage: 24kB
        ->  Limit  (cost=0.00..1.54 rows=100 width=16) (actual time=0.010..0.025 rows=100 loops=1)
              ->  Seq Scan on test_uuid test_uuid_1  (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.009..0.016 rows=100 loops=1)
  ->  Index Only Scan using test_uuid_pkey on test_uuid  (cost=0.42..7.79 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=100)
        Index Cond: (id = test_uuid_1.id)
        Heap Fetches: 100
Planning Time: 0.180 ms
Execution Time: 0.510 ms

On Tue, 31 Jan 2023 at 03:28, Ron <ronljohnsonjr@gmail.com> wrote:

And populate that column with UUIDs generated by the gen_random_uuid() function.

(Requires v13.)

On 1/30/23 13:46, Adrian Klaver wrote:
> On 1/30/23 11:43, veem v wrote:
>> Thank You So much for the details. I am a bit new to postgres. And these
>> test results I picked were from a dev system. If I understand it
>> correctly, do you mean these settings(usage of C locale or "native"
>> 16-byte uuid) which you mentioned should be there in a production system
>>   and thus we should test the performance of the UUID vs sequence on a
>> similar setup? Or say if this sort of degradation of UUID performance is
>> not expected then , how to get these settings tweaked ON, so as to see
>> the best string type or UUID performance, can you please guide me here?
>
> No what is being said is change:
>
> source_id varchar(36)
>
> to
>
> source_id uuid
>
> as i:
>
> https://www.postgresql.org/docs/current/datatype-uuid.html
>
>>
>> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>>     Dominique Devienne <ddevienne@gmail.com
>>     <mailto:ddevienne@gmail.com>> writes:
>>      > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>>     <mailto:veema0000@gmail.com>> wrote:
>>      >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>>     PRIMARY KEY, Name varchar(20) );
>>
>>      > Maybe if you used a "native" 16-byte uuid, instead of its textual
>>      > serialization with dashes (36 bytes + length overhead), the gap would
>>      > narrow.
>>
>>     Yeah, especially if your database is not using C locale. The
>>     strcoll or ICU-based comparisons done on string types can be
>>     enormously more expensive than the memcmp() used for binary
>>     types like native uuid.
>>
>>                              regards, tom lane
>>
>

--
Born in Arizona, moved to Babylonia.


Re: Sequence vs UUID

From
Miles Elam
Date:
On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak <wolakk@gmail.com> wrote:


On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:

1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from ~.3ms to .5ms.


Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).

This article explains in detail what is going on.  If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.

Of course, YMMV...  And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...

Biased comparison. ULIDs have a timestamp component. The closest UUID equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4. Another difference not mentioned in the blog article is that UUID is versioned, meaning you can figure out what kind of data is in the UUID, whereas ULIDs are a "one size fits all" solution.

There is an implementation of sequential UUIDs for Postgres I posted earlier in this thread. In addition, here is an implementation of UUIDv7 for Postgres:

I would suggest running your tests against v1, v7, and sequential UUID before jumping on ULID, which has no native type/indexing in Postgres.

It should also be noted that apps cannot provide a bigint ID due to collisions, but an app can generate UUIDs and ULIDs without fear, essentially shifting the generation time metric in UUID/ULID's favor over a bigserial.

- Miles


Re: Sequence vs UUID

From
Benedict Holland
Date:
Well... until two processes generate an identical UUID. That happened to me several times. It's rare but when that happens, oh boy that is a mess to figure out.

Thanks,
Ben

On Thu, Feb 2, 2023, 10:17 AM Miles Elam <miles.elam@productops.com> wrote:
On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak <wolakk@gmail.com> wrote:


On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:

1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from ~.3ms to .5ms.


Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).

This article explains in detail what is going on.  If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.

Of course, YMMV...  And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...

Biased comparison. ULIDs have a timestamp component. The closest UUID equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4. Another difference not mentioned in the blog article is that UUID is versioned, meaning you can figure out what kind of data is in the UUID, whereas ULIDs are a "one size fits all" solution.

There is an implementation of sequential UUIDs for Postgres I posted earlier in this thread. In addition, here is an implementation of UUIDv7 for Postgres:

I would suggest running your tests against v1, v7, and sequential UUID before jumping on ULID, which has no native type/indexing in Postgres.

It should also be noted that apps cannot provide a bigint ID due to collisions, but an app can generate UUIDs and ULIDs without fear, essentially shifting the generation time metric in UUID/ULID's favor over a bigserial.

- Miles


Re: Sequence vs UUID

From
veem v
Date:
Tested the UUIDv7 generator for postgres as below. 

With regards to performance , It's still way behind the sequence. I was expecting the insert performance of UUID v7 to be closer to the sequence ,  but it doesn't seem so, as it's 500ms vs 3000ms. And the generation takes a lot longer time as compared to sequence too i.e. 59ms vs 1700ms. Read time or the index scan looks close i.e. 2.3ms vs 2.6ms.

*************** UUID7 Alternate way ***********************
create or replace function uuid_generate_v7()
returns uuid
as $$
declare
  unix_ts_ms bytea;
  uuid_bytes bytea;
begin
  unix_ts_ms = substring(int8send(floor(extract(epoch from clock_timestamp()) * 1000)::bigint) from 3);

  -- use random v4 uuid as starting point (which has the same variant we need)
  uuid_bytes = uuid_send(gen_random_uuid());

  -- overlay timestamp
  uuid_bytes = overlay(uuid_bytes placing unix_ts_ms from 1 for 6);

  -- set version 7
  uuid_bytes = set_byte(uuid_bytes, 6, (b'0111' || get_byte(uuid_bytes, 6)::bit(4))::bit(8)::int);

  return encode(uuid_bytes, 'hex')::uuid;
end
$$
language plpgsql
volatile;


*************** Postgress *****************
CREATE UNLOGGED TABLE test_bigint (  id bigint  PRIMARY KEY);
 CREATE UNLOGGED TABLE test_uuid (   id uuid  PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7 (   id uuid  PRIMARY KEY);
create sequence myseq cache 32767;

************* sequence generation vs UUID generation Test**************
explain analyze  select count(nextval('myseq') ) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=59.037..59.039 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=18.541..34.200 rows=100000 loops=1)
Planning Time: 0.099 ms
Execution Time: 59.687 ms

explain analyze  select count(gen_random_uuid()) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=1500.00..1500.01 rows=1 width=8) (actual time=900.633..900.634 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=12.893..65.820 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 904.868 ms

explain analyze  select count(uuid_generate_v7()) from generate_series(1,100000);

QUERY PLAN
Aggregate  (cost=26250.00..26250.01 rows=1 width=8) (actual time=1710.609..1710.610 rows=1 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1000.00 rows=100000 width=0) (actual time=21.807..69.168 rows=100000 loops=1)
Planning Time: 0.048 ms
Execution Time: 1711.187 ms

*************** Insert Test***************

explain analyze INSERT INTO test_bigint select nextval('myseq')  from generate_series(1,100000);

QUERY PLAN
Insert on test_bigint  (cost=0.00..2250.00 rows=0 width=0) (actual time=551.707..551.708 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=8) (actual time=12.438..91.337 rows=100000 loops=1)
Planning Time: 0.053 ms
Execution Time: 552.240 ms

explain analyze  INSERT INTO test_uuid  select gen_random_uuid()  from generate_series(1,100000);

QUERY PLAN
Insert on test_uuid  (cost=0.00..2250.00 rows=0 width=0) (actual time=2040.743..2040.744 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..1250.00 rows=100000 width=16) (actual time=12.829..982.446 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 2041.242 ms

explain analyze  INSERT INTO test_uuid7  select uuid_generate_v7()  from generate_series(1,100000);

QUERY PLAN
Insert on test_uuid7  (cost=0.00..27000.00 rows=0 width=0) (actual time=3234.456..3234.457 rows=0 loops=1)
  ->  Function Scan on generate_series  (cost=0.00..26000.00 rows=100000 width=16) (actual time=12.453..1933.217 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 3234.971 ms

************ Index performance********

Explain analyze select * from test_bigint where id in (select id from test_bigint limit 1000);

EXPLAIN
QUERY PLAN
Nested Loop  (cost=27.22..1322.70 rows=1000 width=8) (actual time=0.414..2.307 rows=1000 loops=1)
  ->  HashAggregate  (cost=26.92..28.92 rows=200 width=8) (actual time=0.370..0.520 rows=1000 loops=1)
        Group Key: test_bigint_1.id
        Batches: 1  Memory Usage: 145kB
        ->  Limit  (cost=0.00..14.42 rows=1000 width=8) (actual time=0.012..0.163 rows=1000 loops=1)
              ->  Seq Scan on test_bigint test_bigint_1  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.093 rows=1000 loops=1)
  ->  Index Only Scan using test_bigint_pkey on test_bigint  (cost=0.29..6.53 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1000)
        Index Cond: (id = test_bigint_1.id)
        Heap Fetches: 1000
Planning Time: 0.395 ms
Execution Time: 2.395 ms


Explain analyze  select * from test_uuid where  id in (select id from test_uuid limit 1000);

QUERY PLAN
Nested Loop  (cost=28.32..1459.58 rows=1000 width=16) (actual time=0.431..3.355 rows=1000 loops=1)
  ->  HashAggregate  (cost=27.91..29.91 rows=200 width=16) (actual time=0.399..0.556 rows=1000 loops=1)
        Group Key: test_uuid_1.id
        Batches: 1  Memory Usage: 145kB
        ->  Limit  (cost=0.00..15.41 rows=1000 width=16) (actual time=0.011..0.185 rows=1000 loops=1)
              ->  Seq Scan on test_uuid test_uuid_1  (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..0.093 rows=1000 loops=1)
  ->  Index Only Scan using test_uuid_pkey on test_uuid  (cost=0.42..7.21 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
        Index Cond: (id = test_uuid_1.id)
        Heap Fetches: 1000
Planning Time: 0.234 ms
Execution Time: 3.419 ms

Explain analyze  select * from test_uuid7 where  id in (select id from test_uuid7 limit 1000);

QUERY PLAN
Nested Loop  (cost=28.32..1416.01 rows=1000 width=16) (actual time=0.403..2.586 rows=1000 loops=1)
  ->  HashAggregate  (cost=27.91..29.91 rows=200 width=16) (actual time=0.371..0.546 rows=1000 loops=1)
        Group Key: test_uuid7_1.id
        Batches: 1  Memory Usage: 145kB
        ->  Limit  (cost=0.00..15.41 rows=1000 width=16) (actual time=0.011..0.161 rows=1000 loops=1)
              ->  Seq Scan on test_uuid7 test_uuid7_1  (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..0.091 rows=1000 loops=1)
  ->  Index Only Scan using test_uuid7_pkey on test_uuid7  (cost=0.42..6.99 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
        Index Cond: (id = test_uuid7_1.id)
        Heap Fetches: 1000
Planning Time: 0.101 ms
Execution Time: 2.661 ms

On Thu, 2 Feb 2023 at 20:52, Benedict Holland <benedict.m.holland@gmail.com> wrote:
Well... until two processes generate an identical UUID. That happened to me several times. It's rare but when that happens, oh boy that is a mess to figure out.

Thanks,
Ben

On Thu, Feb 2, 2023, 10:17 AM Miles Elam <miles.elam@productops.com> wrote:
On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak <wolakk@gmail.com> wrote:


On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:

1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.
2) Insert performance of "sequence" vs "UUID"  execution time increased from ~2031ms to 10599ms.
3) Index performance for sequence vs UUID,  execution time increased from ~.3ms to .5ms.


Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).

This article explains in detail what is going on.  If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.

Of course, YMMV...  And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...

Biased comparison. ULIDs have a timestamp component. The closest UUID equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4. Another difference not mentioned in the blog article is that UUID is versioned, meaning you can figure out what kind of data is in the UUID, whereas ULIDs are a "one size fits all" solution.

There is an implementation of sequential UUIDs for Postgres I posted earlier in this thread. In addition, here is an implementation of UUIDv7 for Postgres:

I would suggest running your tests against v1, v7, and sequential UUID before jumping on ULID, which has no native type/indexing in Postgres.

It should also be noted that apps cannot provide a bigint ID due to collisions, but an app can generate UUIDs and ULIDs without fear, essentially shifting the generation time metric in UUID/ULID's favor over a bigserial.

- Miles


Re: Sequence vs UUID

From
"Peter J. Holzer"
Date:
On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
> Well... until two processes generate an identical UUID. That happened to me
> several times.

How did that happen? Pure software implementation with non-random seed?
Hardware with insufficient entropy source?

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Sequence vs UUID

From
Ron
Date:
On 2/2/23 17:11, Peter J. Holzer wrote:
> On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
>> Well... until two processes generate an identical UUID. That happened to me
>> several times.
> How did that happen? Pure software implementation with non-random seed?
> Hardware with insufficient entropy source?

Poorly implemented algorithm?

-- 
Born in Arizona, moved to Babylonia.



Re: Sequence vs UUID

From
Benedict Holland
Date:
No idea at all. We had the data for the insert and had to insert it again. It was extremely confusing but oh boy did it wreck our systems.

Thanks,
Ben

On Thu, Feb 2, 2023, 6:17 PM Ron <ronljohnsonjr@gmail.com> wrote:
On 2/2/23 17:11, Peter J. Holzer wrote:
> On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
>> Well... until two processes generate an identical UUID. That happened to me
>> several times.
> How did that happen? Pure software implementation with non-random seed?
> Hardware with insufficient entropy source?

Poorly implemented algorithm?

--
Born in Arizona, moved to Babylonia.


Re: Sequence vs UUID

From
Rob Sargent
Date:

> On Feb 2, 2023, at 1:26 PM, Benedict Holland <benedict.m.holland@gmail.com> wrote:
>
> 
> No idea at all. We had the data for the insert and had to insert it again. It was extremely confusing but oh boy did
itwreck our systems. 
>
> Thanks,
> Ben

Someone has a baked-in uuid in a script I suspect.


>



Re: Sequence vs UUID

From
Miles Elam
Date:
On Thu, Feb 2, 2023 at 11:47 AM veem v <veema0000@gmail.com> wrote:
Tested the UUIDv7 generator for postgres as below. 

With regards to performance , It's still way behind the sequence. I was expecting the insert performance of UUID v7 to be closer to the sequence ,  but it doesn't seem so, as it's 500ms vs 3000ms. And the generation takes a lot longer time as compared to sequence too i.e. 59ms vs 1700ms. Read time or the index scan looks close i.e. 2.3ms vs 2.6ms.

Thank you for taking the effort in testing and measuring this.

Those numbers make some intuitive sense to me. The function is written in plpgsql, not C, and is dependent on generating a UUIDv4 and then modifying it to include the timestamp and version change. While I suspect it will never beat a bigint by virtue of 64-bits will always be half the size of 128-bit, the read time on the index scan after it is generated is encouraging with a strong suggestion there's a lot of low-hanging fruit for improvement.

Also, like UUIDv4, v7 can be generated by clients, ameliorating the generation bottleneck.

Once again, thank you for following up with good quality analysis.

Fwd: Sequence vs UUID

From
Dominique Devienne
Date:
Copying the list...

---------- Forwarded message ---------
From: Dominique Devienne <ddevienne@gmail.com>
Date: Fri, Feb 3, 2023 at 4:57 PM
Subject: Re: Sequence vs UUID
To: veem v <veema0000@gmail.com>

On Thu, Feb 2, 2023 at 8:47 PM veem v <veema0000@gmail.com> wrote:
Tested the UUIDv7 generator for postgres as below. 
With regards to performance , It's still way behind the sequence. [...]
explain analyze  select count(nextval('myseq') ) from generate_series(1,100000);
Execution Time: 59.687 ms

explain analyze  select count(gen_random_uuid()) from generate_series(1,100'000);
Execution Time: 904.868 ms

explain analyze  select count(uuid_generate_v7()) from generate_series(1,100000);
Execution Time: 1711.187 ms

Something's off regarding Guid generations IMHO...

You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:

Enabling Performance tests
 
generate 16'000'000 guids in 0.980s (user: 0.984s) 12 MB
generate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s) 12 MB

That's 2 orders of magnitude faster. Sure there's some overhead from the SQL, but still. Something seems fishy.
And that's on a 2.5y old desktop. --DD

Re: Sequence vs UUID

From
veem v
Date:
Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15. 

PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
Am I doing it wrong, please confirm?

On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:
Copying the list...

---------- Forwarded message ---------
From: Dominique Devienne <ddevienne@gmail.com>
Date: Fri, Feb 3, 2023 at 4:57 PM
Subject: Re: Sequence vs UUID
To: veem v <veema0000@gmail.com>

On Thu, Feb 2, 2023 at 8:47 PM veem v <veema0000@gmail.com> wrote:
Tested the UUIDv7 generator for postgres as below. 
With regards to performance , It's still way behind the sequence. [...]
explain analyze  select count(nextval('myseq') ) from generate_series(1,100000);
Execution Time: 59.687 ms

explain analyze  select count(gen_random_uuid()) from generate_series(1,100'000);
Execution Time: 904.868 ms

explain analyze  select count(uuid_generate_v7()) from generate_series(1,100000);
Execution Time: 1711.187 ms

Something's off regarding Guid generations IMHO...

You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:

Enabling Performance tests
 
generate 16'000'000 guids in 0.980s (user: 0.984s) 12 MB
generate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s) 12 MB

That's 2 orders of magnitude faster. Sure there's some overhead from the SQL, but still. Something seems fishy.
And that's on a 2.5y old desktop. --DD

Re: Sequence vs UUID

From
Dominique Devienne
Date:
On Fri, Feb 3, 2023 at 5:48 PM veem v <veema0000@gmail.com> wrote:
Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15. 

PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
Am I doing it wrong, please confirm?

No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are equivalent) is 1/3 of the time, so 30x faster.
So your timings of generating 100K uuids and counting them seems way too slow to me. --DD

sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     15    0                    0   Start at 15
1     Null           0     1     1                    0   r[1..1]=NULL
2     VOpen          0     0     0     vtab:274D3E0   0
3     Integer        1     4     0                    0   r[4]=1
4     Multiply       6     6     5                    0   r[5]=r[6]*r[6]
5     Integer        3     2     0                    0   r[2]=3
6     Integer        2     3     0                    0   r[3]=2
7     VFilter        0     11    2                    0   iplan=r[2] zplan=''
8       Function       1     8     7     randomblob(1)  0   r[7]=func(r[8])
9       AggStep        0     7     1     count(1)       1   accum=r[1] step(r[7])
10    VNext          0     8     0                    0
11    AggFinal       1     1     0     count(1)       0   accum=r[1] N=1
12    Copy           1     9     0                    0   r[9]=r[1]
13    ResultRow      9     1     0                    0   output=r[9]
14    Halt           0     0     0                    0
15    Transaction    0     0     1     0              1   usesStmtJournal=0
16    Integer        1000  6     0                    0   r[6]=1000
17    Integer        16    8     0                    0   r[8]=16
18    Goto           0     1     0                    0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000               │
└───────────────────────┘
Run Time: real 0.278 user 0.250000 sys 0.000000
 
On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:
Something's off regarding Guid generations IMHO...
You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:

Re: Sequence vs UUID

From
veem v
Date:
So, it may be the machine on which the code is getting executed behind the scene , in the site "https://dbfiddle.uk/" is playing a key role in the speed, however, the comparative performance of UUID vs sequence should stay the same.
 So I think, after this test we can safely conclude that if we compare the performance of the UUID(both version-4, version 7) VS sequence. The UUID performs a lot worse as compared to sequence. So unless there exists some strong reason/justification for UUID, we should default use the sequence. Correct me if I'm wrong. And also  I understand the cases of multi master replication/sharding etc, may be a factor but other than that I can't think of any scenario where sequences can be used.



On Fri, 3 Feb 2023 at 23:07, Dominique Devienne <ddevienne@gmail.com> wrote:
On Fri, Feb 3, 2023 at 5:48 PM veem v <veema0000@gmail.com> wrote:
Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15. 

PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
Am I doing it wrong, please confirm?

No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are equivalent) is 1/3 of the time, so 30x faster.
So your timings of generating 100K uuids and counting them seems way too slow to me. --DD

sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     15    0                    0   Start at 15
1     Null           0     1     1                    0   r[1..1]=NULL
2     VOpen          0     0     0     vtab:274D3E0   0
3     Integer        1     4     0                    0   r[4]=1
4     Multiply       6     6     5                    0   r[5]=r[6]*r[6]
5     Integer        3     2     0                    0   r[2]=3
6     Integer        2     3     0                    0   r[3]=2
7     VFilter        0     11    2                    0   iplan=r[2] zplan=''
8       Function       1     8     7     randomblob(1)  0   r[7]=func(r[8])
9       AggStep        0     7     1     count(1)       1   accum=r[1] step(r[7])
10    VNext          0     8     0                    0
11    AggFinal       1     1     0     count(1)       0   accum=r[1] N=1
12    Copy           1     9     0                    0   r[9]=r[1]
13    ResultRow      9     1     0                    0   output=r[9]
14    Halt           0     0     0                    0
15    Transaction    0     0     1     0              1   usesStmtJournal=0
16    Integer        1000  6     0                    0   r[6]=1000
17    Integer        16    8     0                    0   r[8]=16
18    Goto           0     1     0                    0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000               │
└───────────────────────┘
Run Time: real 0.278 user 0.250000 sys 0.000000
 
On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:
Something's off regarding Guid generations IMHO...
You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:

Re: Sequence vs UUID

From
Julian Backes
Date:
I don't really understand what you mean by 'performance'. To me it is not surprising that incrementing (I know it is not just incrementing) a 64bit integer is faster than generating 128 bit data with a good amount of random data even if it seems to be too slow. So in my opinion you need to separate
1) generating data (which might happen on the client in case of UUID and not in the db...)
2) inserting data
3) selecting data

in both sequential as well as parallel scenarios.

Am Mo., 6. Feb. 2023 um 19:32 Uhr schrieb veem v <veema0000@gmail.com>:
So, it may be the machine on which the code is getting executed behind the scene , in the site "https://dbfiddle.uk/" is playing a key role in the speed, however, the comparative performance of UUID vs sequence should stay the same.
 So I think, after this test we can safely conclude that if we compare the performance of the UUID(both version-4, version 7) VS sequence. The UUID performs a lot worse as compared to sequence. So unless there exists some strong reason/justification for UUID, we should default use the sequence. Correct me if I'm wrong. And also  I understand the cases of multi master replication/sharding etc, may be a factor but other than that I can't think of any scenario where sequences can be used.



On Fri, 3 Feb 2023 at 23:07, Dominique Devienne <ddevienne@gmail.com> wrote:
On Fri, Feb 3, 2023 at 5:48 PM veem v <veema0000@gmail.com> wrote:
Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15. 

PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit
Am I doing it wrong, please confirm?

No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are equivalent) is 1/3 of the time, so 30x faster.
So your timings of generating 100K uuids and counting them seems way too slow to me. --DD

sqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     15    0                    0   Start at 15
1     Null           0     1     1                    0   r[1..1]=NULL
2     VOpen          0     0     0     vtab:274D3E0   0
3     Integer        1     4     0                    0   r[4]=1
4     Multiply       6     6     5                    0   r[5]=r[6]*r[6]
5     Integer        3     2     0                    0   r[2]=3
6     Integer        2     3     0                    0   r[3]=2
7     VFilter        0     11    2                    0   iplan=r[2] zplan=''
8       Function       1     8     7     randomblob(1)  0   r[7]=func(r[8])
9       AggStep        0     7     1     count(1)       1   accum=r[1] step(r[7])
10    VNext          0     8     0                    0
11    AggFinal       1     1     0     count(1)       0   accum=r[1] N=1
12    Copy           1     9     0                    0   r[9]=r[1]
13    ResultRow      9     1     0                    0   output=r[9]
14    Halt           0     0     0                    0
15    Transaction    0     0     1     0              1   usesStmtJournal=0
16    Integer        1000  6     0                    0   r[6]=1000
17    Integer        16    8     0                    0   r[8]=16
18    Goto           0     1     0                    0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000               │
└───────────────────────┘
Run Time: real 0.278 user 0.250000 sys 0.000000
 
On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:
Something's off regarding Guid generations IMHO...
You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:

Re: Sequence vs UUID

From
"Peter J. Holzer"
Date:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
> I don't really understand what you mean by 'performance'. To me it is not
> surprising that incrementing (I know it is not just incrementing) a
> 64bit integer is faster than generating 128 bit data with a good amount of
> random data even if it seems to be too slow.

That's not really the problem with UUIDs, though. My (not very fast)
laptop can call getrandom() 1 million times per second (in a single
thread). Avoiding the system call[1] could make this fast enough to be
completely negligible compared to the time of writing a row to disk.

But UUIDs are random and that plays havoc with locality. For example
consider one table with invoices and another with invoice items. If you
want to get all the invoices including the items of a single day, the
data is probably nicely clustered together in the tables. But the join
needs to look up random ids in the index, which will be spread all over
the index. In a simple benchmark for this scenario the UUIDs were about
4.5 times slower than sequential ids. (In other benchmarks the
difference was only a few percent)

So depending on the querys the difference may be negligible or huge.

It really depends on your access patterns.

        hp


[1] There was even a discussion about making that much faster on the
    LKML recently.


--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Sequence vs UUID

From
Merlin Moncure
Date:
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
> I don't really understand what you mean by 'performance'. To me it is not
> surprising that incrementing (I know it is not just incrementing) a
> 64bit integer is faster than generating 128 bit data with a good amount of
> random data even if it seems to be too slow.

But UUIDs are random and that plays havoc with locality. For example
consider one table with invoices and another with invoice items. If you
want to get all the invoices including the items of a single day, the
data is probably nicely clustered together in the tables. But the join
needs to look up random ids in the index, which will be spread all over
the index. In a simple benchmark for this scenario the UUIDs were about
4.5 times slower than sequential ids. (In other benchmarks the
difference was only a few percent)

This is really key.  

While many of the people posting here may understand this, all of the databases I've seen that are written with the UUID pattern appear to be written by developers oblivious to this fact.  The UUID pattern seems to be popular with developers who see abstract away the database underneath the code and might use an ORM and be weaker in terms of database facing constraint checking.  My direct observation is that these databases scale poorly and the developers spend a lot of time building tools that fix broken data stemming from application bugs.  

I'm certain this is not the experience of everyone here.  I do however find the counter sequence arguments to be somewhat silly; partition safe sequence generation is simple to solve using simple methods. "ID guessing" is not insecure along similar lines; if your application relies on id obfuscation to be secure you might have much bigger issues to contend with IMO.

merlin
 

Re: Sequence vs UUID

From
Dominique Devienne
Date:
On Tue, Feb 7, 2023 at 3:47 PM Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
But UUIDs are random and that plays havoc with locality.

This is really key.  [...] the databases I've seen that are written with the
UUID pattern appear to be written by developers oblivious to this fact.

Well, perhaps these developers are not dealing with temporally clustered data, like commerce related DB,
and more scientific data? In any case, this developer will definitely investigate ULIDs, vs UUIDs, vs Ints (sequences),
based on the info from this thread. I'm aware of fragmentation issues, and cache hit/miss issues etc, in general;
but was probably not sufficiently making the mental connection with UUIDs and PostgreSQL. So thanks everyone.

Re: Sequence vs UUID

From
veem v
Date:
Thank you So much all for such valuable feedback. 

As "Julian" was pointing, I also tried to test the INSERT independently(as in below test case) without keeping the "generate_series" in the inline query. But in all the cases sequence is performing better as compared to both UUID V4 and UUID V7. And same with Index access path observed i.e. an index on sequence performs better as compared to an index on UUID column. So i believe , its understood that the performance wise sequence is better as compared to even sequential UUID (like UUID-V7). And it seems that, even the UUID V7 maintains better locality because of its sequential nature, but still it all may tied to the higher space/memory consumption of UUID because of its bigger size as compared to sequence and thus the performance hit.

But as i saw few comments in this discussion regarding the scenarios which will mandate the usage of UUID like "multi-master replication", "sharding", "Need to be able to move data between databases" Etc..So wanted to understand , why cant we use sequences as PK in these scenarios? Say for e.g. in case of multimaster replication we can use sequence someway as below.., 

So wanted to know from experts here, is there really exists any scenario in which UUID really cant be avoided?

Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the dimensions of the multi-master replication, m ranges from 0 to n-1 is the number assigned to each node in the replication, and offset is the number to offset the sequence numbers. 

For a 4-ways multi-master replication where m=4, y is in (0, 1, 2, 3), and offset is 100. 
Node #1 (m=0) :Sequence number = n*4+100
Node #2 (m=1): Sequence number = n*4+101
Node #3 (m=2): Sequence number = n*4+102
Node #4 (m=3): Sequence number = n*4+103

Each sequence will have: 
100, 104, 108,112, 116, 120,...
101, 105, 109, 113, 117, 121,...
102, 106, 110, 114, 118, 122...
103, 107, 111, 115, 119, 123

***************** Test case *************************

CREATE UNLOGGED TABLE test_bigint ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_bigint_1 ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid_1 ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7_1 ( id uuid PRIMARY KEY);
create sequence myseq cache 32767;

*************** Insert Test***************
INSERT INTO test_bigint select nextval('myseq') from generate_series(1,100000);
INSERT INTO test_uuid select gen_random_uuid() from generate_series(1,100000);
INSERT INTO test_uuid7 select uuid_generate_v7() from generate_series(1,100000);


explain analyze INSERT INTO test_bigint_1 select id from test_bigint;

EXPLAIN
QUERY PLAN
Insert on test_bigint_1 (cost=0.00..1444.18 rows=0 width=0) (actual time=220.689..220.690 rows=0 loops=1)
-> Seq Scan on test_bigint (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.012..17.488 rows=100000 loops=1)
Planning Time: 0.137 ms
Execution Time: 220.714 ms

explain analyze INSERT INTO test_uuid_1 select id from test_uuid;

EXPLAIN
QUERY PLAN
Insert on test_uuid_1 (cost=0.00..1541.85 rows=0 width=0) (actual time=311.949..311.950 rows=0 loops=1)
-> Seq Scan on test_uuid (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..20.173 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 311.973 ms


explain analyze INSERT INTO test_uuid7_1 select id from test_uuid7;

EXPLAIN
QUERY PLAN
Insert on test_uuid7_1 (cost=0.00..1541.85 rows=0 width=0) (actual time=244.531..244.532 rows=0 loops=1)
-> Seq Scan on test_uuid7 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.011..16.390 rows=100000 loops=1)
Planning Time: 0.084 ms
Execution Time: 244.554 ms

Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100000);

Nested Loop (cost=2692.77..3793.23 rows=50059 width=8) (actual time=53.739..368.110 rows=100000 loops=1)
-> HashAggregate (cost=2692.48..2694.48 rows=200 width=8) (actual time=53.681..93.044 rows=100000 loops=1)
Group Key: test_bigint_1.id
Batches: 5 Memory Usage: 11073kB Disk Usage: 208kB
-> Limit (cost=0.00..1442.48 rows=100000 width=8) (actual time=0.020..18.985 rows=100000 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.019..11.330 rows=100000 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..6.53 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100000)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 100000
Planning Time: 0.373 ms
Execution Time: 373.440 ms
EXPLAIN

Explain analyze select * from test_uuid where id in (select id from test_uuid limit 100000);

QUERY PLAN
Nested Loop (cost=2790.96..4006.29 rows=50042 width=16) (actual time=48.251..410.786 rows=100000 loops=1)
-> HashAggregate (cost=2790.54..2792.54 rows=200 width=16) (actual time=48.157..76.176 rows=100000 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory Usage: 7969kB
-> Limit (cost=0.00..1540.54 rows=100000 width=16) (actual time=0.018..19.217 rows=100000 loops=1)
-> Seq Scan on test_uuid test_uuid_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.017..11.391 rows=100000 loops=1)
-> Index Only Scan using test_uuid_pkey on test_uuid (cost=0.42..7.20 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=100000)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 100000
Planning Time: 0.378 ms
Execution Time: 415.547 ms

Explain analyze select * from test_uuid7 where id in (select id from test_uuid7 limit 100000);

EXPLAIN
QUERY PLAN
Nested Loop (cost=2790.96..3972.96 rows=50042 width=16) (actual time=45.919..401.184 rows=100000 loops=1)
-> HashAggregate (cost=2790.54..2792.54 rows=200 width=16) (actual time=45.806..80.545 rows=100000 loops=1)
Group Key: test_uuid7_1.id
Batches: 1 Memory Usage: 7969kB
-> Limit (cost=0.00..1540.54 rows=100000 width=16) (actual time=0.020..18.573 rows=100000 loops=1)
-> Seq Scan on test_uuid7 test_uuid7_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.018..10.920 rows=100000 loops=1)
-> Index Only Scan using test_uuid7_pkey on test_uuid7 (cost=0.42..6.99 rows=1 width=16) (actual time=0.003..0.003 rows=1 loops=100000)
Index Cond: (id = test_uuid7_1.id)
Heap Fetches: 100000
Planning Time: 0.194 ms
Execution Time: 405.931 ms

On Tue, 7 Feb, 2023, 9:38 pm Dominique Devienne, <ddevienne@gmail.com> wrote:
On Tue, Feb 7, 2023 at 3:47 PM Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
But UUIDs are random and that plays havoc with locality.

This is really key.  [...] the databases I've seen that are written with the
UUID pattern appear to be written by developers oblivious to this fact.

Well, perhaps these developers are not dealing with temporally clustered data, like commerce related DB,
and more scientific data? In any case, this developer will definitely investigate ULIDs, vs UUIDs, vs Ints (sequences),
based on the info from this thread. I'm aware of fragmentation issues, and cache hit/miss issues etc, in general;
but was probably not sufficiently making the mental connection with UUIDs and PostgreSQL. So thanks everyone.

Re: Sequence vs UUID

From
"Peter J. Holzer"
Date:
On 2023-02-08 14:48:03 +0530, veem v wrote:
> So wanted to know from experts here, is there really exists any scenario in
> which UUID really cant be avoided?

Probably not. The question is usually not "is this possible" but "does
this meet the requirements at acceptable cost".


> Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the
> dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> number assigned to each node in the replication, and offset is the number to
> offset the sequence numbers. 

Yes, you can do this. In fact, people (including me) have already done
this.

But it's relatively easy to mess this up:

Firstly, you have to make sure that d is larger than your number of
(active) replicas will ever be, but still small enough that you will
never overflow. Probably not a problem with 64 bit sequences (if you set
d to 1E6, you can still count to 9E12 on each node), but might be a
problem if you are for some reason limited to 32 bits.

Secondly (and IMHO more importantly) you have to make sure each node
gets its own unique offset. So this needs to be ensured during
deployment, but also during migrations, restores from backups and other
infrequent events.

With random Ids you don't have to worry about this.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

Re: Sequence vs UUID

From
Kirk Wolak
Date:
On Wed, Feb 8, 2023 at 4:18 AM veem v <veema0000@gmail.com> wrote:
Thank you So much all for such valuable feedback. 
..
So wanted to know from experts here, is there really exists any scenario in which UUID really cant be avoided?

Funny you are asking about this.  My recent experience is that UUIDs really get crushed on performance in medium (> 5 million rows) tables.
I found an article by Dave Allie on ULID, and I modified his implementation to create a timestamp(6) (microsecond level) sequenced version.

Doing an article on this soon.  But WITHOUT calling the "gen_random_bytes" I can generate 2 timestamps at the same microsecond level.
Once that call is included in the function, I've never been close to returning 2 timestamps at the same microsecond level.  Although I did not
run this on multiple threads.  This fit our needs for an efficient UUID formatted key...

9 Bytes (18 Hex Digits) of Randomness at the far right.

Oh, and some time after the year 10,000 you will get some wrap around... But I expect 256 bit UUIDs will take over before then.


CREATE FUNCTION generate_ulid() RETURNS uuid
LANGUAGE sql
RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * (1000000)::numeric)))::bigint), 14, '0'::text)
      || encode(gen_random_bytes(9), 'hex'::text)))::uuid;

 

Re: Sequence vs UUID

From
Miles Elam
Date:
On Wed, Feb 8, 2023 at 11:56 AM Kirk Wolak <wolakk@gmail.com> wrote:
>
> CREATE FUNCTION generate_ulid() RETURNS uuid
>     LANGUAGE sql
> RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * (1000000)::numeric)))::bigint), 14, '0'::text)
>       || encode(gen_random_bytes(9), 'hex'::text)))::uuid;

You can save yourself some CPU by skipping the extra cast, omitting
the lpad, to_hex, and floor, and just grabbing the bytes from the
bigint directly along with the random part since bytea can be
concatenated.

    SELECT encode(int8send((EXTRACT(epoch FROM clock_timestamp()) *
1000000)::bigint)
                  || gen_random_bytes(8), 'hex')::uuid
    ;

Note that you refer to it as a ULID, but it is stored as a UUID.
Hopefully nothing downstream ever relies on UUID versioning/spec
compliance. Now that I think of it, I could probably speed up my
tagged interval UUID implementation using some of this at the expense
of configurability.



Re: Sequence vs UUID

From
Merlin Moncure
Date:
On Wed, Feb 8, 2023 at 5:33 AM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:
On 2023-02-08 14:48:03 +0530, veem v wrote:
> So wanted to know from experts here, is there really exists any scenario in
> which UUID really cant be avoided?

Probably not. The question is usually not "is this possible" but "does
this meet the requirements at acceptable cost".


> Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the
> dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> number assigned to each node in the replication, and offset is the number to
> offset the sequence numbers. 

Yes, you can do this. In fact, people (including me) have already done
this.

But it's relatively easy to mess this up:

Firstly, you have to make sure that d is larger than your number of
(active) replicas will ever be, but still small enough that you will
never overflow. Probably not a problem with 64 bit sequences (if you set
d to 1E6, you can still count to 9E12 on each node), but might be a
problem if you are for some reason limited to 32 bits.

Secondly (and IMHO more importantly) you have to make sure each node
gets its own unique offset. So this needs to be ensured during
deployment, but also during migrations, restores from backups and other
infrequent events.

??  All you have to do is ensure each node has its own unique id, and that id is involved in sequence generation.  This has to be done for other reasons than id generation, and is a zero effort/risk process. 

The id would then contain the identifier of the node that generated the id, rather than the node that contains the id.  This is exactly analogous to strategies that use mac# as part of id prefix for example.  Once generated, it's known unique and you don't have to consider anything.   This is exactly what I do, and there is no interaction with backups, deployments, migrations, etc.  Node expansion does require that each node requires a unique node id, and that's it.

merlin