Thread: Sequence vs UUID
Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
> On Jan 26, 2023, at 11:17, veem v <veema0000@gmail.com> wrote: > So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons listavailable for each of those to understand the exact scenario in which we should go for one over other? Clear rules are a bit difficult to come by here, but there are general guidelines. First, the distinction isn't exactly UUIDs vs sequences. There are two distinctions: 1. UUIDs vs bigints. 2. Sequential values vs random values. The advantage of bigints vs UUIDs is that bigints will faster for PostgreSQL to process in a variety of ways, as well asbeing half the size of a UUID (so, smaller tables, smaller indexes, etc.). The main advantage of UUIDs vs bigints is,if you are using random values, there's existing easy-to-use infrastructure for generating UUIDs vs generating randombigints. The advantage of sequential values is that they interact much better with caching than random values. Random values willhave a harder time maintaining a reasonable in-memory working set than sequential values. (Of course, if the databasefits entirely in memory, this isn't as much of an issue). The advantage of random values is that they are more opaque;if there is a 123480102 in a sequential key, an attacker can be confident there's also a 123480103, which a randomvalue avoids. There are algorithms for generating sequential values that avoid this by having hard-to-guess less significantdigits. Another advantage of sequential values is that they are (roughly) time-ordered, so they can be used to get "most recent"efficiently. One concern about sequential values that generally is not a real issue is the bottleneck of creating new sequential values. The sequence functionality in PostgreSQL is very concurrency-friendly. UUIDs can be generated in such a way that they have sequential properties; see: https://github.com/tvondra/sequential-uuids
Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
* UUIDS are basically random data causing page fragmentation. this is particularly bad in auto clustering architectures like sql server
"Sweeping ranges of records" by ID suggests you have information in your id. If you're comfortable with sequence generation as a surrogate for time-point of entry, by all means have at it.On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
* UUIDs are big for identifer (16 bytes) sequence is 4 or 8* sequences are NOT a contention point, database uses some tricks to work around that
* UUIDS are basically random data causing page fragmentation. this is particularly bad in auto clustering architectures like sql server* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges of records on large tables.merlin
For many of us, contemporaneously generated records have nothing to do with each other. (And they carry attributes which groups them.)
UUIDs do a very good job of avoiding id-overlap across domains (compare to a sequence for each domain).
> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote: > > Hello, We were trying to understand whether we should use UUID or Sequence in > general for primary keys. In many of the blogs (one is below) across multiple > databases, I saw over the internet and all are mostly stating the sequence is > better as compared to UUID. But I think in the case of concurrent data load > scenarios UUID will spread the contention point whereas sequence can be a > single point of contention. > > So we want to understand from experts here, if there are any clear rules > available or if we have any pros vs cons list available for each of those to > understand the exact scenario in which we should go for one over other? > Basically I wanted to see if we can perform some test on sample data to see > the percentage of overhead on read and write performances of the query in > presence of UUID VS Sequence to draw some conclusion in general? And also > considering open source postgres as the base for many databases like redshift > etc, so the results which apply to progress would apply to others as well. > > https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/ I think that[1] provides a good summary. Performance consideration is just one aspect. Is there a technical requirement for using UUID over sequential values? If there's a single generator of primary keys use bigint sequences. In case of multiple generators (multi-master replication, sharding, clients generating IDs) consider UUID. There are arguments against sequential PK, e.g. they give away too much info and allow attacks such as forced browsing[2]. The first I can understand: you may not want to reveal the number of users or customers. But access control should prevent forced browsing. [1] https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/ [2] https://owasp.org/www-community/attacks/Forced_browsing -- Erik
> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote:
>
> Hello, We were trying to understand whether we should use UUID or Sequence in
> general for primary keys. In many of the blogs (one is below) across multiple
> databases, I saw over the internet and all are mostly stating the sequence is
> better as compared to UUID. But I think in the case of concurrent data load
> scenarios UUID will spread the contention point whereas sequence can be a
> single point of contention.
>
> So we want to understand from experts here, if there are any clear rules
> available or if we have any pros vs cons list available for each of those to
> understand the exact scenario in which we should go for one over other?
> Basically I wanted to see if we can perform some test on sample data to see
> the percentage of overhead on read and write performances of the query in
> presence of UUID VS Sequence to draw some conclusion in general? And also
> considering open source postgres as the base for many databases like redshift
> etc, so the results which apply to progress would apply to others as well.
>
> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
I think that[1] provides a good summary. Performance consideration is just one
aspect. Is there a technical requirement for using UUID over sequential values?
If there's a single generator of primary keys use bigint sequences. In case of
multiple generators (multi-master replication, sharding, clients generating IDs)
consider UUID.
There are arguments against sequential PK, e.g. they give away too much info and
allow attacks such as forced browsing[2]. The first I can understand: you may
not want to reveal the number of users or customers. But access control should
prevent forced browsing.
[1] https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
[2] https://owasp.org/www-community/attacks/Forced_browsing
--
Erik
On 1/26/23 14:36, Merlin Moncure wrote:On Thu, Jan 26, 2023 at 1:18 PM veem v <veema0000@gmail.com> wrote:Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of the blogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is better as compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention point whereas sequence can be a single point of contention.
So we want to understand from experts here, if there are any clear rules available or if we have any pros vs cons list available for each of those to understand the exact scenario in which we should go for one over other? Basically I wanted to see if we can perform some test on sample data to see the percentage of overhead on read and write performances of the query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres as the base for many databases like redshift etc, so the results which apply to progress would apply to others as well.
https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/
* UUIDs are big for identifer (16 bytes) sequence is 4 or 8* sequences are NOT a contention point, database uses some tricks to work around that
* UUIDS are basically random data causing page fragmentation. this is particularly bad in auto clustering architectures like sql server* Also, UUIDS can deliver very poor buffer hit ratios when sweeping ranges of records on large tables.
"Sweeping ranges of records" by ID suggests you have information in your id. If you're comfortable with sequence generation as a surrogate for time-point of entry, by all means have at it.
For many of us, contemporaneously generated records have nothing to do with each other. (And they carry attributes which groups them.)
UUIDs do a very good job of avoiding id-overlap across domains (compare to a sequence for each domain).
On 1/26/23 15:55, Erik Wienhold wrote: >> On 26/01/2023 20:17 CET veem v <veema0000@gmail.com> wrote: >> >> Hello, We were trying to understand whether we should use UUID or Sequence in >> general for primary keys. In many of the blogs (one is below) across multiple >> databases, I saw over the internet and all are mostly stating the sequence is >> better as compared to UUID. But I think in the case of concurrent data load >> scenarios UUID will spread the contention point whereas sequence can be a >> single point of contention. >> >> So we want to understand from experts here, if there are any clear rules >> available or if we have any pros vs cons list available for each of those to >> understand the exact scenario in which we should go for one over other? >> Basically I wanted to see if we can perform some test on sample data to see >> the percentage of overhead on read and write performances of the query in >> presence of UUID VS Sequence to draw some conclusion in general? And also >> considering open source postgres as the base for many databases like redshift >> etc, so the results which apply to progress would apply to others as well. >> >> https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/ > I think that[1] provides a good summary. Performance consideration is just one > aspect. Is there a technical requirement for using UUID over sequential values? > > If there's a single generator of primary keys use bigint sequences. In case of > multiple generators (multi-master replication, sharding, clients generating IDs) > consider UUID. > > There are arguments against sequential PK, e.g. they give away too much info and > allow attacks such as forced browsing[2]. The first I can understand: you may > not want to reveal the number of users or customers. But access control should > prevent forced browsing. Shouldn't your application layer isolate the users from the database? UUIDs are all over the DBs I manage, but the PKs are all sequences. -- Born in Arizona, moved to Babylonia.
Hello. I have been using UUID for quite a long time now. The reason I began to use UUID was the need to be able to move data between databases and the need to create record outside the database. You should use UUID as a primary key for a record and alsohave some bookkeeping UUID:s in the record like "origin" of the record. In this way it is "easy" to handle different sources of data. We have also written some nice replicating software on this basis (postsync) that can check for alterations in one database and update others. In this way we can keep one or many replicas of databases. So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with UUID is so mutch better. Sequence keys are a terrible idea! // GH Den 2023-01-26 kl. 20:17, skrev veem v: > Hello, We were trying to understand whether we should use UUID or Sequence in general for primary keys. In many of theblogs(one is below) across multiple databases, I saw over the internet and all are mostly stating the sequence is betteras compared to UUID. But I think in the case of concurrent data load scenarios UUID will spread the contention pointwhereas sequence can be a single point of contention. > > So we want to understand from experts here, if there are anyclear rules available or if we haveany pros vs cons list availablefor each of those to understand the exact scenario in which we should go for one over other? Basically I wantedtosee if we can perform some test on sample data to see the percentage of overhead on read and write performances ofthe query in presence of UUID VS Sequence to draw some conclusion in general? And also considering open source postgres asthe base for many databases like redshift etc, so the results which apply to progress would apply to others aswell. > > https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/ <https://www.percona.com/blog/2019/11/22/uuids-are-popular-but-bad-for-performance-lets-discuss/> > -- gorhas@raditex.nu http://www.raditex.nu Raditex Control AB Bo 229, 715 91 ODENSBACKEN Mob: 070-5530148
> So forget about performance issues (there will ALWAYS be need for faster systems). The ease and functionality with UUID > is so mutch better. Sequence keys are a terrible idea! > > // GH > Wow. I am not alone >
> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote: > > On 1/26/23 15:55, Erik Wienhold wrote: > > > > There are arguments against sequential PK, e.g. they give away too much info and > > allow attacks such as forced browsing[2]. The first I can understand: you may > > not want to reveal the number of users or customers. But access control should > > prevent forced browsing. > > Shouldn't your application layer isolate the users from the database? UUIDs > are all over the DBs I manage, but the PKs are all sequences. Yes, I meant the application layer, not Postgres' access control. -- Erik
> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much info and
> > allow attacks such as forced browsing[2]. The first I can understand: you may
> > not want to reveal the number of users or customers. But access control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the database? UUIDs
> are all over the DBs I manage, but the PKs are all sequences.
Yes, I meant the application layer, not Postgres' access control.
--
Erik
This lesson was burned into my psyche waaaay back in the Clinton administration. It was my task to speed up a five hour batch job which read input records from a flat file, did some validations and then inserted them. Simply sorting the input file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours. (VMS SORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record at the top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.)
Why is it a terrible idea? I have been using them for years without a single problem. I don't rely on them for create order. Terrible seem a bit extreme.Thanks,BenOn Sat, Jan 28, 2023, 3:39 PM Erik Wienhold <ewie@ewie.name> wrote:> On 27/01/2023 01:48 CET Ron <ronljohnsonjr@gmail.com> wrote:
>
> On 1/26/23 15:55, Erik Wienhold wrote:
> >
> > There are arguments against sequential PK, e.g. they give away too much info and
> > allow attacks such as forced browsing[2]. The first I can understand: you may
> > not want to reveal the number of users or customers. But access control should
> > prevent forced browsing.
>
> Shouldn't your application layer isolate the users from the database? UUIDs
> are all over the DBs I manage, but the PKs are all sequences.
Yes, I meant the application layer, not Postgres' access control.
--
Erik
Born in Arizona, moved to Babylonia.
On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote: > > Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor. > > This lesson was burned into my psyche waaaay back in the Clinton administration. It was my task to speed up a five hourbatch job which read input records from a flat file, did some validations and then inserted them. Simply sorting theinput file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours. (VMSSORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record atthe top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.) This can be mitigated with judicious use of a sequence at the front of the uuidv4. https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ More effort than just calling the built-in gen_random_uuid() or equivalent in app code, but a substantial performance gain for your effort. https://github.com/tvondra/sequential-uuids And in a managed environment where you can't install custom extensions, a fairly simple function with divide on unix epoch seconds combined with a call to overlay(...) should suffice performance-wise. At 60 seconds, this will loop every 45 days or so, and you can choose how much "sequentialness" works for you, from 1 to 4 bytes at the expense of pseudo-randomness. ----------------------------- -- Generate time interval UUID CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 = 60, block_num_bytes int2 = 2) RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$ SELECT encode( overlay( -- convert the uuid to byte array uuid_send(gen_random_uuid()) -- use only the bottom bytes PLACING substring( int4send((extract(epoch FROM now()) / interval_seconds)::int4) FROM (5 - block_num_bytes) ) -- place at the front two bytes of the uuid FROM 1 ) -- convert the resulting byte array to hex for conversion to uuid , 'hex')::uuid WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4 $$; ----------------------------- Technically affecting the v4 spec. You could always convert to a UUIDv8, which is the intention behind that new version even though the standard hasn't been ratified yet. Cheers, Miles Elam
Then it's not a Type 4 UUID, which is perfectly fine; just not random. Also, should now() be replaced by clock_timestamp(), so that it can be called multiple times in the same transaction? On 1/28/23 21:28, Miles Elam wrote: > On Sat, Jan 28, 2023 at 6:02 PM Ron <ronljohnsonjr@gmail.com> wrote: >> Type 4 UUIDs are sub-optimal for big table because cache hit rates drop through the floor. >> >> This lesson was burned into my psyche waaaay back in the Clinton administration. It was my task to speed up a five hourbatch job which read input records from a flat file, did some validations and then inserted them. Simply sorting theinput file on the primary key fields -- we used natural keys, not synthetics -- dropped the run time to two hours. (VMSSORT saved the day, because you could tell it the sort order you wanted; thus, I could preserve the header record atthe top of the file, and the trailer record at the end of the file without jumping through a bunch of hoops.) > This can be mitigated with judicious use of a sequence at the front of > the uuidv4. > > https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/ > > More effort than just calling the built-in gen_random_uuid() or > equivalent in app code, but a substantial performance gain for your > effort. > > https://github.com/tvondra/sequential-uuids > > And in a managed environment where you can't install custom > extensions, a fairly simple function with divide on unix epoch seconds > combined with a call to overlay(...) should suffice performance-wise. > At 60 seconds, this will loop every 45 days or so, and you can choose > how much "sequentialness" works for you, from 1 to 4 bytes at the > expense of pseudo-randomness. > > ----------------------------- > > -- Generate time interval UUID > CREATE OR REPLACE FUNCTION gen_interval_uuid(interval_seconds int4 = > 60, block_num_bytes int2 = 2) > RETURNS uuid LANGUAGE sql VOLATILE PARALLEL SAFE AS $$ > SELECT encode( > overlay( > -- convert the uuid to byte array > uuid_send(gen_random_uuid()) > -- use only the bottom bytes > PLACING substring( > int4send((extract(epoch FROM now()) / interval_seconds)::int4) > FROM (5 - block_num_bytes) > ) > -- place at the front two bytes of the uuid > FROM 1 > ) > -- convert the resulting byte array to hex for conversion to uuid > , 'hex')::uuid > WHERE interval_seconds > 0 AND block_num_bytes BETWEEN 1 AND 4 > $$; > > ----------------------------- > > Technically affecting the v4 spec. You could always convert to a > UUIDv8, which is the intention behind that new version even though the > standard hasn't been ratified yet. > > > Cheers, > > Miles Elam > > -- Born in Arizona, moved to Babylonia.
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote: > > Then it's not a Type 4 UUID, which is perfectly fine; just not random. Yep, which is why it really should be re-versioned to UUIDv8 to be pedantic. In everyday use though, almost certainly doesn't matter. > Also, should now() be replaced by clock_timestamp(), so that it can be > called multiple times in the same transaction? Not necessary. Instead of 122 bits of entropy, you get 106 bits of entropy and a new incremented prefix every minute. now() vs clock_timestamp() wouldn't make a substantive difference. Should still be reasonably safe against the birthday paradox for more than a century when creating more than a million UUIDs per second.
On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.
> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?
Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.
I have a question, As i understand here, usage wise there are multiple benefits of UUID over sequences like, in case of distributed app where we may not be able to rely on one point generator like sequences, in case of multi master architecture, sharding.If we just look in terms of performance wise, the key advantage of sequence is that for read queries, because of the storage size it will be smaller and thus it will cache more index rows and so will be beneficial during read queries and should also be beneficial even on joins because of its smaller size. Also fetching a value from sequence is cheaper than calculating the UUIDS. But the downside is during write operation, it can be a point of contention in case of concurrent data load as every incoming request will try to modify same table/index page/block. But as its mentioned in this blog (https://www.2ndquadrant.com/en/blog/sequential-uuid-generators/), state this UUID can be made sequential so even this can be sequential using prefix-timestamp etc. However isn't it that making the UUID sequential will again actually be a disadvantage and can be contention point for this unique index as each incoming write will now fight for same block/page while doing concurrent data load and will contend for the same table block or say one side of the index branch/leaf block etc, whereas in case of random UUIDs the write was spreading across multiple blocks so there was no contention on any specific blocks? Please correct if my understanding is wrong?On Sun, 29 Jan, 2023, 10:33 am Miles Elam, <miles.elam@productops.com> wrote:On Sat, Jan 28, 2023 at 8:02 PM Ron <ronljohnsonjr@gmail.com> wrote:
>
> Then it's not a Type 4 UUID, which is perfectly fine; just not random.
Yep, which is why it really should be re-versioned to UUIDv8 to be
pedantic. In everyday use though, almost certainly doesn't matter.
> Also, should now() be replaced by clock_timestamp(), so that it can be
> called multiple times in the same transaction?
Not necessary. Instead of 122 bits of entropy, you get 106 bits of
entropy and a new incremented prefix every minute. now() vs
clock_timestamp() wouldn't make a substantive difference. Should still
be reasonably safe against the birthday paradox for more than a
century when creating more than a million UUIDs per second.
On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote: > CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) ); Maybe if you used a "native" 16-byte uuid, instead of its textual serialization with dashes (36 bytes + length overhead), the gap would narrow.
Dominique Devienne <ddevienne@gmail.com> writes: > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote: >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) ); > Maybe if you used a "native" 16-byte uuid, instead of its textual > serialization with dashes (36 bytes + length overhead), the gap would > narrow. Yeah, especially if your database is not using C locale. The strcoll or ICU-based comparisons done on string types can be enormously more expensive than the memcmp() used for binary types like native uuid. regards, tom lane
Dominique Devienne <ddevienne@gmail.com> writes:
> On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com> wrote:
>> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) PRIMARY KEY, Name varchar(20) );
> Maybe if you used a "native" 16-byte uuid, instead of its textual
> serialization with dashes (36 bytes + length overhead), the gap would
> narrow.
Yeah, especially if your database is not using C locale. The
strcoll or ICU-based comparisons done on string types can be
enormously more expensive than the memcmp() used for binary
types like native uuid.
regards, tom lane
On 1/30/23 11:43, veem v wrote: > Thank You So much for the details. I am a bit new to postgres. And these > test results I picked were from a dev system. If I understand it > correctly, do you mean these settings(usage of C locale or "native" > 16-byte uuid) which you mentioned should be there in a production system > and thus we should test the performance of the UUID vs sequence on a > similar setup? Or say if this sort of degradation of UUID performance is > not expected then , how to get these settings tweaked ON, so as to see > the best string type or UUID performance, can you please guide me here? No what is being said is change: source_id varchar(36) to source_id uuid as i: https://www.postgresql.org/docs/current/datatype-uuid.html > > On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us > <mailto:tgl@sss.pgh.pa.us>> wrote: > > Dominique Devienne <ddevienne@gmail.com > <mailto:ddevienne@gmail.com>> writes: > > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com > <mailto:veema0000@gmail.com>> wrote: > >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) > PRIMARY KEY, Name varchar(20) ); > > > Maybe if you used a "native" 16-byte uuid, instead of its textual > > serialization with dashes (36 bytes + length overhead), the gap would > > narrow. > > Yeah, especially if your database is not using C locale. The > strcoll or ICU-based comparisons done on string types can be > enormously more expensive than the memcmp() used for binary > types like native uuid. > > regards, tom lane > -- Adrian Klaver adrian.klaver@aklaver.com
And populate that column with UUIDs generated by the gen_random_uuid() function. (Requires v13.) On 1/30/23 13:46, Adrian Klaver wrote: > On 1/30/23 11:43, veem v wrote: >> Thank You So much for the details. I am a bit new to postgres. And these >> test results I picked were from a dev system. If I understand it >> correctly, do you mean these settings(usage of C locale or "native" >> 16-byte uuid) which you mentioned should be there in a production system >> and thus we should test the performance of the UUID vs sequence on a >> similar setup? Or say if this sort of degradation of UUID performance is >> not expected then , how to get these settings tweaked ON, so as to see >> the best string type or UUID performance, can you please guide me here? > > No what is being said is change: > > source_id varchar(36) > > to > > source_id uuid > > as i: > > https://www.postgresql.org/docs/current/datatype-uuid.html > >> >> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us >> <mailto:tgl@sss.pgh.pa.us>> wrote: >> >> Dominique Devienne <ddevienne@gmail.com >> <mailto:ddevienne@gmail.com>> writes: >> > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com >> <mailto:veema0000@gmail.com>> wrote: >> >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36) >> PRIMARY KEY, Name varchar(20) ); >> >> > Maybe if you used a "native" 16-byte uuid, instead of its textual >> > serialization with dashes (36 bytes + length overhead), the gap would >> > narrow. >> >> Yeah, especially if your database is not using C locale. The >> strcoll or ICU-based comparisons done on string types can be >> enormously more expensive than the memcmp() used for binary >> types like native uuid. >> >> regards, tom lane >> > -- Born in Arizona, moved to Babylonia.
create sequence myseq cache 32767;
************ sequence generation vs UUID generation Test**************
explain analyze select count(nextval('myseq') ) from generate_series(1,100000);
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=291.030..291.030 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=53.332..63.941 rows=100000 loops=1)
Planning Time: 0.155 ms
Execution Time: 291.719 ms
explain analyze select count(gen_random_uuid()) from generate_series(1,100000);
QUERY PLAN
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=5654.453..5654.454 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=84.328..514.214 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 5655.158 ms
*************** Insert Test***************
explain analyze INSERT INTO test_bigint select nextval('myseq') from generate_series(1,100000);
QUERY PLAN
Insert on test_bigint (cost=0.00..2250.00 rows=0 width=0) (actual time=2030.960..2030.961 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=8) (actual time=48.102..636.311 rows=100000 loops=1)
Planning Time: 0.065 ms
Execution Time: 2031.469 ms
explain analyze INSERT INTO test_uuid select gen_random_uuid() from generate_series(1,100000);
QUERY PLAN
Insert on test_uuid (cost=0.00..2250.00 rows=0 width=0) (actual time=10599.230..10599.230 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=16) (actual time=36.975..6289.811 rows=100000 loops=1)
Planning Time: 0.056 ms
Execution Time: 10599.805 ms
************ Index performance********
Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100);
QUERY PLAN
Nested Loop (cost=2.98..734.71 rows=100 width=8) (actual time=0.083..0.269 rows=100 loops=1)
-> HashAggregate (cost=2.69..3.69 rows=100 width=8) (actual time=0.046..0.062 rows=100 loops=1)
Group Key: test_bigint_1.id
Batches: 1 Memory Usage: 24kB
-> Limit (cost=0.00..1.44 rows=100 width=8) (actual time=0.011..0.025 rows=100 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017 rows=100 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 100
Planning Time: 0.279 ms
Execution Time: 0.302 ms
Explain analyze select * from test_uuid where id in (select id from test_uuid limit 100);
QUERY PLAN
Nested Loop (cost=3.21..783.31 rows=100 width=16) (actual time=0.080..0.474 rows=100 loops=1)
-> HashAggregate (cost=2.79..3.79 rows=100 width=16) (actual time=0.046..0.066 rows=100 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory Usage: 24kB
-> Limit (cost=0.00..1.54 rows=100 width=16) (actual time=0.010..0.025 rows=100 loops=1)
-> Seq Scan on test_uuid test_uuid_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.009..0.016 rows=100 loops=1)
-> Index Only Scan using test_uuid_pkey on test_uuid (cost=0.42..7.79 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=100)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 100
Planning Time: 0.180 ms
Execution Time: 0.510 ms
And populate that column with UUIDs generated by the gen_random_uuid() function.
(Requires v13.)
On 1/30/23 13:46, Adrian Klaver wrote:
> On 1/30/23 11:43, veem v wrote:
>> Thank You So much for the details. I am a bit new to postgres. And these
>> test results I picked were from a dev system. If I understand it
>> correctly, do you mean these settings(usage of C locale or "native"
>> 16-byte uuid) which you mentioned should be there in a production system
>> and thus we should test the performance of the UUID vs sequence on a
>> similar setup? Or say if this sort of degradation of UUID performance is
>> not expected then , how to get these settings tweaked ON, so as to see
>> the best string type or UUID performance, can you please guide me here?
>
> No what is being said is change:
>
> source_id varchar(36)
>
> to
>
> source_id uuid
>
> as i:
>
> https://www.postgresql.org/docs/current/datatype-uuid.html
>
>>
>> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>> Dominique Devienne <ddevienne@gmail.com
>> <mailto:ddevienne@gmail.com>> writes:
>> > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>> <mailto:veema0000@gmail.com>> wrote:
>> >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>> PRIMARY KEY, Name varchar(20) );
>>
>> > Maybe if you used a "native" 16-byte uuid, instead of its textual
>> > serialization with dashes (36 bytes + length overhead), the gap would
>> > narrow.
>>
>> Yeah, especially if your database is not using C locale. The
>> strcoll or ICU-based comparisons done on string types can be
>> enormously more expensive than the memcmp() used for binary
>> types like native uuid.
>>
>> regards, tom lane
>>
>
--
Born in Arizona, moved to Babylonia.
I tried to test quickly below on dbfiddle, below with the UUID as data type and in each of the below cases the UUID performance seems drastically reduced as compared to sequence performance. Let me know if anything is wrong in my testing here?1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.2) Insert performance of "sequence" vs "UUID" execution time increased from ~2031ms to 10599ms.3) Index performance for sequence vs UUID, execution time increased from ~.3ms to .5ms.
HTH
PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bitCREATE UNLOGGED TABLE test_bigint ( id bigint PRIMARY KEY);CREATE UNLOGGED TABLE test_uuid ( id uuid PRIMARY KEY);
create sequence myseq cache 32767;
************ sequence generation vs UUID generation Test**************
explain analyze select count(nextval('myseq') ) from generate_series(1,100000);QUERY PLAN
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=291.030..291.030 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=53.332..63.941 rows=100000 loops=1)
Planning Time: 0.155 ms
Execution Time: 291.719 ms
explain analyze select count(gen_random_uuid()) from generate_series(1,100000);
QUERY PLAN
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=5654.453..5654.454 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=84.328..514.214 rows=100000 loops=1)
Planning Time: 0.082 ms
Execution Time: 5655.158 ms
*************** Insert Test***************
explain analyze INSERT INTO test_bigint select nextval('myseq') from generate_series(1,100000);
QUERY PLAN
Insert on test_bigint (cost=0.00..2250.00 rows=0 width=0) (actual time=2030.960..2030.961 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=8) (actual time=48.102..636.311 rows=100000 loops=1)
Planning Time: 0.065 ms
Execution Time: 2031.469 ms
explain analyze INSERT INTO test_uuid select gen_random_uuid() from generate_series(1,100000);
QUERY PLAN
Insert on test_uuid (cost=0.00..2250.00 rows=0 width=0) (actual time=10599.230..10599.230 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=16) (actual time=36.975..6289.811 rows=100000 loops=1)
Planning Time: 0.056 ms
Execution Time: 10599.805 ms
************ Index performance********
Explain analyze select * from test_bigint where id in (select id from test_bigint limit 100);
QUERY PLAN
Nested Loop (cost=2.98..734.71 rows=100 width=8) (actual time=0.083..0.269 rows=100 loops=1)
-> HashAggregate (cost=2.69..3.69 rows=100 width=8) (actual time=0.046..0.062 rows=100 loops=1)
Group Key: test_bigint_1.id
Batches: 1 Memory Usage: 24kB
-> Limit (cost=0.00..1.44 rows=100 width=8) (actual time=0.011..0.025 rows=100 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.017 rows=100 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..7.31 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=100)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 100
Planning Time: 0.279 ms
Execution Time: 0.302 ms
Explain analyze select * from test_uuid where id in (select id from test_uuid limit 100);
QUERY PLAN
Nested Loop (cost=3.21..783.31 rows=100 width=16) (actual time=0.080..0.474 rows=100 loops=1)
-> HashAggregate (cost=2.79..3.79 rows=100 width=16) (actual time=0.046..0.066 rows=100 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory Usage: 24kB
-> Limit (cost=0.00..1.54 rows=100 width=16) (actual time=0.010..0.025 rows=100 loops=1)
-> Seq Scan on test_uuid test_uuid_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.009..0.016 rows=100 loops=1)
-> Index Only Scan using test_uuid_pkey on test_uuid (cost=0.42..7.79 rows=1 width=16) (actual time=0.004..0.004 rows=1 loops=100)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 100
Planning Time: 0.180 ms
Execution Time: 0.510 msOn Tue, 31 Jan 2023 at 03:28, Ron <ronljohnsonjr@gmail.com> wrote:
And populate that column with UUIDs generated by the gen_random_uuid() function.
(Requires v13.)
On 1/30/23 13:46, Adrian Klaver wrote:
> On 1/30/23 11:43, veem v wrote:
>> Thank You So much for the details. I am a bit new to postgres. And these
>> test results I picked were from a dev system. If I understand it
>> correctly, do you mean these settings(usage of C locale or "native"
>> 16-byte uuid) which you mentioned should be there in a production system
>> and thus we should test the performance of the UUID vs sequence on a
>> similar setup? Or say if this sort of degradation of UUID performance is
>> not expected then , how to get these settings tweaked ON, so as to see
>> the best string type or UUID performance, can you please guide me here?
>
> No what is being said is change:
>
> source_id varchar(36)
>
> to
>
> source_id uuid
>
> as i:
>
> https://www.postgresql.org/docs/current/datatype-uuid.html
>
>>
>> On Mon, 30 Jan 2023 at 22:18, Tom Lane <tgl@sss.pgh.pa.us
>> <mailto:tgl@sss.pgh.pa.us>> wrote:
>>
>> Dominique Devienne <ddevienne@gmail.com
>> <mailto:ddevienne@gmail.com>> writes:
>> > On Mon, Jan 30, 2023 at 5:11 PM veem v <veema0000@gmail.com
>> <mailto:veema0000@gmail.com>> wrote:
>> >> CREATE TABLE test1_UUID ( id bigint,source_id varchar(36)
>> PRIMARY KEY, Name varchar(20) );
>>
>> > Maybe if you used a "native" 16-byte uuid, instead of its textual
>> > serialization with dashes (36 bytes + length overhead), the gap would
>> > narrow.
>>
>> Yeah, especially if your database is not using C locale. The
>> strcoll or ICU-based comparisons done on string types can be
>> enormously more expensive than the memcmp() used for binary
>> types like native uuid.
>>
>> regards, tom lane
>>
>
--
Born in Arizona, moved to Babylonia.
On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.2) Insert performance of "sequence" vs "UUID" execution time increased from ~2031ms to 10599ms.3) Index performance for sequence vs UUID, execution time increased from ~.3ms to .5ms.Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).This article explains in detail what is going on. If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.Of course, YMMV... And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...
On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak <wolakk@gmail.com> wrote:On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.2) Insert performance of "sequence" vs "UUID" execution time increased from ~2031ms to 10599ms.3) Index performance for sequence vs UUID, execution time increased from ~.3ms to .5ms.Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).This article explains in detail what is going on. If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.Of course, YMMV... And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...Biased comparison. ULIDs have a timestamp component. The closest UUID equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4. Another difference not mentioned in the blog article is that UUID is versioned, meaning you can figure out what kind of data is in the UUID, whereas ULIDs are a "one size fits all" solution.There is an implementation of sequential UUIDs for Postgres I posted earlier in this thread. In addition, here is an implementation of UUIDv7 for Postgres:I would suggest running your tests against v1, v7, and sequential UUID before jumping on ULID, which has no native type/indexing in Postgres.It should also be noted that apps cannot provide a bigint ID due to collisions, but an app can generate UUIDs and ULIDs without fear, essentially shifting the generation time metric in UUID/ULID's favor over a bigserial.- Miles
create or replace function uuid_generate_v7()
returns uuid
as $$
declare
unix_ts_ms bytea;
uuid_bytes bytea;
begin
unix_ts_ms = substring(int8send(floor(extract(epoch from clock_timestamp()) * 1000)::bigint) from 3);
-- use random v4 uuid as starting point (which has the same variant we need)
uuid_bytes = uuid_send(gen_random_uuid());
-- overlay timestamp
uuid_bytes = overlay(uuid_bytes placing unix_ts_ms from 1 for 6);
-- set version 7
uuid_bytes = set_byte(uuid_bytes, 6, (b'0111' || get_byte(uuid_bytes, 6)::bit(4))::bit(8)::int);
return encode(uuid_bytes, 'hex')::uuid;
end
$$
language plpgsql
volatile;
*************** Postgress *****************
CREATE UNLOGGED TABLE test_bigint ( id bigint PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid ( id uuid PRIMARY KEY);
CREATE UNLOGGED TABLE test_uuid7 ( id uuid PRIMARY KEY);
create sequence myseq cache 32767;
************* sequence generation vs UUID generation Test**************
explain analyze select count(nextval('myseq') ) from generate_series(1,100000);
QUERY PLAN
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=59.037..59.039 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=18.541..34.200 rows=100000 loops=1)
Planning Time: 0.099 ms
Execution Time: 59.687 ms
explain analyze select count(gen_random_uuid()) from generate_series(1,100000);
QUERY PLAN
Aggregate (cost=1500.00..1500.01 rows=1 width=8) (actual time=900.633..900.634 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=12.893..65.820 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 904.868 ms
explain analyze select count(uuid_generate_v7()) from generate_series(1,100000);
QUERY PLAN
Aggregate (cost=26250.00..26250.01 rows=1 width=8) (actual time=1710.609..1710.610 rows=1 loops=1)
-> Function Scan on generate_series (cost=0.00..1000.00 rows=100000 width=0) (actual time=21.807..69.168 rows=100000 loops=1)
Planning Time: 0.048 ms
Execution Time: 1711.187 ms
*************** Insert Test***************
explain analyze INSERT INTO test_bigint select nextval('myseq') from generate_series(1,100000);
QUERY PLAN
Insert on test_bigint (cost=0.00..2250.00 rows=0 width=0) (actual time=551.707..551.708 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=8) (actual time=12.438..91.337 rows=100000 loops=1)
Planning Time: 0.053 ms
Execution Time: 552.240 ms
explain analyze INSERT INTO test_uuid select gen_random_uuid() from generate_series(1,100000);
QUERY PLAN
Insert on test_uuid (cost=0.00..2250.00 rows=0 width=0) (actual time=2040.743..2040.744 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..1250.00 rows=100000 width=16) (actual time=12.829..982.446 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 2041.242 ms
explain analyze INSERT INTO test_uuid7 select uuid_generate_v7() from generate_series(1,100000);
QUERY PLAN
Insert on test_uuid7 (cost=0.00..27000.00 rows=0 width=0) (actual time=3234.456..3234.457 rows=0 loops=1)
-> Function Scan on generate_series (cost=0.00..26000.00 rows=100000 width=16) (actual time=12.453..1933.217 rows=100000 loops=1)
Planning Time: 0.051 ms
Execution Time: 3234.971 ms
************ Index performance********
Explain analyze select * from test_bigint where id in (select id from test_bigint limit 1000);
EXPLAIN
QUERY PLAN
Nested Loop (cost=27.22..1322.70 rows=1000 width=8) (actual time=0.414..2.307 rows=1000 loops=1)
-> HashAggregate (cost=26.92..28.92 rows=200 width=8) (actual time=0.370..0.520 rows=1000 loops=1)
Group Key: test_bigint_1.id
Batches: 1 Memory Usage: 145kB
-> Limit (cost=0.00..14.42 rows=1000 width=8) (actual time=0.012..0.163 rows=1000 loops=1)
-> Seq Scan on test_bigint test_bigint_1 (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.011..0.093 rows=1000 loops=1)
-> Index Only Scan using test_bigint_pkey on test_bigint (cost=0.29..6.53 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1000)
Index Cond: (id = test_bigint_1.id)
Heap Fetches: 1000
Planning Time: 0.395 ms
Execution Time: 2.395 ms
Explain analyze select * from test_uuid where id in (select id from test_uuid limit 1000);
QUERY PLAN
Nested Loop (cost=28.32..1459.58 rows=1000 width=16) (actual time=0.431..3.355 rows=1000 loops=1)
-> HashAggregate (cost=27.91..29.91 rows=200 width=16) (actual time=0.399..0.556 rows=1000 loops=1)
Group Key: test_uuid_1.id
Batches: 1 Memory Usage: 145kB
-> Limit (cost=0.00..15.41 rows=1000 width=16) (actual time=0.011..0.185 rows=1000 loops=1)
-> Seq Scan on test_uuid test_uuid_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..0.093 rows=1000 loops=1)
-> Index Only Scan using test_uuid_pkey on test_uuid (cost=0.42..7.21 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = test_uuid_1.id)
Heap Fetches: 1000
Planning Time: 0.234 ms
Execution Time: 3.419 ms
Explain analyze select * from test_uuid7 where id in (select id from test_uuid7 limit 1000);
QUERY PLAN
Nested Loop (cost=28.32..1416.01 rows=1000 width=16) (actual time=0.403..2.586 rows=1000 loops=1)
-> HashAggregate (cost=27.91..29.91 rows=200 width=16) (actual time=0.371..0.546 rows=1000 loops=1)
Group Key: test_uuid7_1.id
Batches: 1 Memory Usage: 145kB
-> Limit (cost=0.00..15.41 rows=1000 width=16) (actual time=0.011..0.161 rows=1000 loops=1)
-> Seq Scan on test_uuid7 test_uuid7_1 (cost=0.00..1541.85 rows=100085 width=16) (actual time=0.010..0.091 rows=1000 loops=1)
-> Index Only Scan using test_uuid7_pkey on test_uuid7 (cost=0.42..6.99 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = test_uuid7_1.id)
Heap Fetches: 1000
Planning Time: 0.101 ms
Execution Time: 2.661 ms
Well... until two processes generate an identical UUID. That happened to me several times. It's rare but when that happens, oh boy that is a mess to figure out.Thanks,BenOn Thu, Feb 2, 2023, 10:17 AM Miles Elam <miles.elam@productops.com> wrote:On Wed, Feb 1, 2023 at 10:48 AM Kirk Wolak <wolakk@gmail.com> wrote:On Wed, Feb 1, 2023 at 1:34 PM veem v <veema0000@gmail.com> wrote:1) sequence generation vs UUID generation, execution time increased from ~291ms to 5655ms.2) Insert performance of "sequence" vs "UUID" execution time increased from ~2031ms to 10599ms.3) Index performance for sequence vs UUID, execution time increased from ~.3ms to .5ms.Yes, assuming that UUIDs would be efficient as keys when they are randomly generated, versus sequences (which tend to expand in one direction, and have been relatively optimized for years).This article explains in detail what is going on. If I were doing this, I would strongly consider a ULID because of getting the best of both worlds.Of course, YMMV... And since ULIDs are not native to PG, there is overhead, but it is far more performant, IMO...Biased comparison. ULIDs have a timestamp component. The closest UUID equivalent in Postgres is UUIDv1 from the uuid-ossp extension, not v4. Another difference not mentioned in the blog article is that UUID is versioned, meaning you can figure out what kind of data is in the UUID, whereas ULIDs are a "one size fits all" solution.There is an implementation of sequential UUIDs for Postgres I posted earlier in this thread. In addition, here is an implementation of UUIDv7 for Postgres:I would suggest running your tests against v1, v7, and sequential UUID before jumping on ULID, which has no native type/indexing in Postgres.It should also be noted that apps cannot provide a bigint ID due to collisions, but an app can generate UUIDs and ULIDs without fear, essentially shifting the generation time metric in UUID/ULID's favor over a bigserial.- Miles
On 2023-02-02 10:22:09 -0500, Benedict Holland wrote: > Well... until two processes generate an identical UUID. That happened to me > several times. How did that happen? Pure software implementation with non-random seed? Hardware with insufficient entropy source? hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2/2/23 17:11, Peter J. Holzer wrote: > On 2023-02-02 10:22:09 -0500, Benedict Holland wrote: >> Well... until two processes generate an identical UUID. That happened to me >> several times. > How did that happen? Pure software implementation with non-random seed? > Hardware with insufficient entropy source? Poorly implemented algorithm? -- Born in Arizona, moved to Babylonia.
On 2/2/23 17:11, Peter J. Holzer wrote:
> On 2023-02-02 10:22:09 -0500, Benedict Holland wrote:
>> Well... until two processes generate an identical UUID. That happened to me
>> several times.
> How did that happen? Pure software implementation with non-random seed?
> Hardware with insufficient entropy source?
Poorly implemented algorithm?
--
Born in Arizona, moved to Babylonia.
> On Feb 2, 2023, at 1:26 PM, Benedict Holland <benedict.m.holland@gmail.com> wrote: > > > No idea at all. We had the data for the insert and had to insert it again. It was extremely confusing but oh boy did itwreck our systems. > > Thanks, > Ben Someone has a baked-in uuid in a script I suspect. >
Tested the UUIDv7 generator for postgres as below.With regards to performance , It's still way behind the sequence. I was expecting the insert performance of UUID v7 to be closer to the sequence , but it doesn't seem so, as it's 500ms vs 3000ms. And the generation takes a lot longer time as compared to sequence too i.e. 59ms vs 1700ms. Read time or the index scan looks close i.e. 2.3ms vs 2.6ms.
From: Dominique Devienne <ddevienne@gmail.com>
Date: Fri, Feb 3, 2023 at 4:57 PM
Subject: Re: Sequence vs UUID
To: veem v <veema0000@gmail.com>
Tested the UUIDv7 generator for postgres as below.With regards to performance , It's still way behind the sequence. [...]explain analyze select count(nextval('myseq') ) from generate_series(1,100000);Execution Time: 59.687 ms
explain analyze select count(gen_random_uuid()) from generate_series(1,100'000);
Execution Time: 904.868 ms
explain analyze select count(uuid_generate_v7()) from generate_series(1,100000);
Execution Time: 1711.187 ms
Enabling Performance tests
generate 16'000'000 guids in 0.980s (user: 0.984s) 12 MB
generate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s) 12 MB
PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit |
Am I doing it wrong, please confirm?
Copying the list...---------- Forwarded message ---------
From: Dominique Devienne <ddevienne@gmail.com>
Date: Fri, Feb 3, 2023 at 4:57 PM
Subject: Re: Sequence vs UUID
To: veem v <veema0000@gmail.com>On Thu, Feb 2, 2023 at 8:47 PM veem v <veema0000@gmail.com> wrote:Tested the UUIDv7 generator for postgres as below.With regards to performance , It's still way behind the sequence. [...]explain analyze select count(nextval('myseq') ) from generate_series(1,100000);Execution Time: 59.687 ms
explain analyze select count(gen_random_uuid()) from generate_series(1,100'000);
Execution Time: 904.868 ms
explain analyze select count(uuid_generate_v7()) from generate_series(1,100000);
Execution Time: 1711.187 msSomething's off regarding Guid generations IMHO...You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:Enabling Performance testsgenerate 16'000'000 guids in 0.980s (user: 0.984s) 12 MBgenerate 16'000'000 guids in parallel on 4 CPUs in 0.309s (user: 1.188s) 12 MBThat's 2 orders of magnitude faster. Sure there's some overhead from the SQL, but still. Something seems fishy.And that's on a 2.5y old desktop. --DD
Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15.
PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit Am I doing it wrong, please confirm?
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 15 0 0 Start at 15
1 Null 0 1 1 0 r[1..1]=NULL
2 VOpen 0 0 0 vtab:274D3E0 0
3 Integer 1 4 0 0 r[4]=1
4 Multiply 6 6 5 0 r[5]=r[6]*r[6]
5 Integer 3 2 0 0 r[2]=3
6 Integer 2 3 0 0 r[3]=2
7 VFilter 0 11 2 0 iplan=r[2] zplan=''
8 Function 1 8 7 randomblob(1) 0 r[7]=func(r[8])
9 AggStep 0 7 1 count(1) 1 accum=r[1] step(r[7])
10 VNext 0 8 0 0
11 AggFinal 1 1 0 count(1) 0 accum=r[1] N=1
12 Copy 1 9 0 0 r[9]=r[1]
13 ResultRow 9 1 0 0 output=r[9]
14 Halt 0 0 0 0
15 Transaction 0 0 1 0 1 usesStmtJournal=0
16 Integer 1000 6 0 0 r[6]=1000
17 Integer 16 8 0 0 r[8]=16
18 Goto 0 1 0 0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000 │
└───────────────────────┘
On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:Something's off regarding Guid generations IMHO...You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:
On Fri, Feb 3, 2023 at 5:48 PM veem v <veema0000@gmail.com> wrote:Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15.
PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit Am I doing it wrong, please confirm?
No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are equivalent) is 1/3 of the time, so 30x faster.So your timings of generating 100K uuids and counting them seems way too slow to me. --DDsqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 15 0 0 Start at 15
1 Null 0 1 1 0 r[1..1]=NULL
2 VOpen 0 0 0 vtab:274D3E0 0
3 Integer 1 4 0 0 r[4]=1
4 Multiply 6 6 5 0 r[5]=r[6]*r[6]
5 Integer 3 2 0 0 r[2]=3
6 Integer 2 3 0 0 r[3]=2
7 VFilter 0 11 2 0 iplan=r[2] zplan=''
8 Function 1 8 7 randomblob(1) 0 r[7]=func(r[8])
9 AggStep 0 7 1 count(1) 1 accum=r[1] step(r[7])
10 VNext 0 8 0 0
11 AggFinal 1 1 0 count(1) 0 accum=r[1] N=1
12 Copy 1 9 0 0 r[9]=r[1]
13 ResultRow 9 1 0 0 output=r[9]
14 Halt 0 0 0 0
15 Transaction 0 0 1 0 1 usesStmtJournal=0
16 Integer 1000 6 0 0 r[6]=1000
17 Integer 16 8 0 0 r[8]=16
18 Goto 0 1 0 0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000 │
└───────────────────────┘Run Time: real 0.278 user 0.250000 sys 0.000000On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:Something's off regarding Guid generations IMHO...You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:
So, it may be the machine on which the code is getting executed behind the scene , in the site "https://dbfiddle.uk/" is playing a key role in the speed, however, the comparative performance of UUID vs sequence should stay the same.So I think, after this test we can safely conclude that if we compare the performance of the UUID(both version-4, version 7) VS sequence. The UUID performs a lot worse as compared to sequence. So unless there exists some strong reason/justification for UUID, we should default use the sequence. Correct me if I'm wrong. And also I understand the cases of multi master replication/sharding etc, may be a factor but other than that I can't think of any scenario where sequences can be used.On Fri, 3 Feb 2023 at 23:07, Dominique Devienne <ddevienne@gmail.com> wrote:On Fri, Feb 3, 2023 at 5:48 PM veem v <veema0000@gmail.com> wrote:Actually I did the testing by connecting to "https://dbfiddle.uk/" postgres version -15.
PostgreSQL 15.0 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), 64-bit Am I doing it wrong, please confirm?
No clue. But even SQLite can generate 1M uuid (16 bytes random blobs are equivalent) is 1/3 of the time, so 30x faster.So your timings of generating 100K uuids and counting them seems way too slow to me. --DDsqlite> select count(randomblob(16)) from generate_series(1,1000*1000);
QUERY PLAN
`--SCAN generate_series VIRTUAL TABLE INDEX 3:
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 15 0 0 Start at 15
1 Null 0 1 1 0 r[1..1]=NULL
2 VOpen 0 0 0 vtab:274D3E0 0
3 Integer 1 4 0 0 r[4]=1
4 Multiply 6 6 5 0 r[5]=r[6]*r[6]
5 Integer 3 2 0 0 r[2]=3
6 Integer 2 3 0 0 r[3]=2
7 VFilter 0 11 2 0 iplan=r[2] zplan=''
8 Function 1 8 7 randomblob(1) 0 r[7]=func(r[8])
9 AggStep 0 7 1 count(1) 1 accum=r[1] step(r[7])
10 VNext 0 8 0 0
11 AggFinal 1 1 0 count(1) 0 accum=r[1] N=1
12 Copy 1 9 0 0 r[9]=r[1]
13 ResultRow 9 1 0 0 output=r[9]
14 Halt 0 0 0 0
15 Transaction 0 0 1 0 1 usesStmtJournal=0
16 Integer 1000 6 0 0 r[6]=1000
17 Integer 16 8 0 0 r[8]=16
18 Goto 0 1 0 0
┌───────────────────────┐
│ count(randomblob(16)) │
├───────────────────────┤
│ 1000000 │
└───────────────────────┘Run Time: real 0.278 user 0.250000 sys 0.000000On Fri, 3 Feb 2023 at 21:28, Dominique Devienne <ddevienne@gmail.com> wrote:Something's off regarding Guid generations IMHO...You generate 100K Guids in ~1s. While we generate (in C++, Windows Release, using Boost) 16M of them in +/- the same time:
On 2023-02-06 20:04:39 +0100, Julian Backes wrote: > I don't really understand what you mean by 'performance'. To me it is not > surprising that incrementing (I know it is not just incrementing) a > 64bit integer is faster than generating 128 bit data with a good amount of > random data even if it seems to be too slow. That's not really the problem with UUIDs, though. My (not very fast) laptop can call getrandom() 1 million times per second (in a single thread). Avoiding the system call[1] could make this fast enough to be completely negligible compared to the time of writing a row to disk. But UUIDs are random and that plays havoc with locality. For example consider one table with invoices and another with invoice items. If you want to get all the invoices including the items of a single day, the data is probably nicely clustered together in the tables. But the join needs to look up random ids in the index, which will be spread all over the index. In a simple benchmark for this scenario the UUIDs were about 4.5 times slower than sequential ids. (In other benchmarks the difference was only a few percent) So depending on the querys the difference may be negligible or huge. It really depends on your access patterns. hp [1] There was even a discussion about making that much faster on the LKML recently. -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
> I don't really understand what you mean by 'performance'. To me it is not
> surprising that incrementing (I know it is not just incrementing) a
> 64bit integer is faster than generating 128 bit data with a good amount of
> random data even if it seems to be too slow.
But UUIDs are random and that plays havoc with locality. For example
consider one table with invoices and another with invoice items. If you
want to get all the invoices including the items of a single day, the
data is probably nicely clustered together in the tables. But the join
needs to look up random ids in the index, which will be spread all over
the index. In a simple benchmark for this scenario the UUIDs were about
4.5 times slower than sequential ids. (In other benchmarks the
difference was only a few percent)
On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
But UUIDs are random and that plays havoc with locality.This is really key. [...] the databases I've seen that are written with the
UUID pattern appear to be written by developers oblivious to this fact.
On Tue, Feb 7, 2023 at 3:47 PM Merlin Moncure <mmoncure@gmail.com> wrote:On Mon, Feb 6, 2023 at 1:22 PM Peter J. Holzer <hjp-pgsql@hjp.at> wrote:On 2023-02-06 20:04:39 +0100, Julian Backes wrote:
But UUIDs are random and that plays havoc with locality.This is really key. [...] the databases I've seen that are written with theUUID pattern appear to be written by developers oblivious to this fact.Well, perhaps these developers are not dealing with temporally clustered data, like commerce related DB,and more scientific data? In any case, this developer will definitely investigate ULIDs, vs UUIDs, vs Ints (sequences),based on the info from this thread. I'm aware of fragmentation issues, and cache hit/miss issues etc, in general;but was probably not sufficiently making the mental connection with UUIDs and PostgreSQL. So thanks everyone.
On 2023-02-08 14:48:03 +0530, veem v wrote: > So wanted to know from experts here, is there really exists any scenario in > which UUID really cant be avoided? Probably not. The question is usually not "is this possible" but "does this meet the requirements at acceptable cost". > Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the > dimensions of the multi-master replication, m ranges from 0 to n-1 is the > number assigned to each node in the replication, and offset is the number to > offset the sequence numbers. Yes, you can do this. In fact, people (including me) have already done this. But it's relatively easy to mess this up: Firstly, you have to make sure that d is larger than your number of (active) replicas will ever be, but still small enough that you will never overflow. Probably not a problem with 64 bit sequences (if you set d to 1E6, you can still count to 9E12 on each node), but might be a problem if you are for some reason limited to 32 bits. Secondly (and IMHO more importantly) you have to make sure each node gets its own unique offset. So this needs to be ensured during deployment, but also during migrations, restores from backups and other infrequent events. With random Ids you don't have to worry about this. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
Thank you So much all for such valuable feedback...So wanted to know from experts here, is there really exists any scenario in which UUID really cant be avoided?
CREATE FUNCTION generate_ulid() RETURNS uuid
LANGUAGE sql
RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * (1000000)::numeric)))::bigint), 14, '0'::text)
|| encode(gen_random_bytes(9), 'hex'::text)))::uuid;
On Wed, Feb 8, 2023 at 11:56 AM Kirk Wolak <wolakk@gmail.com> wrote: > > CREATE FUNCTION generate_ulid() RETURNS uuid > LANGUAGE sql > RETURN ((lpad(to_hex((floor((EXTRACT(epoch FROM clock_timestamp()) * (1000000)::numeric)))::bigint), 14, '0'::text) > || encode(gen_random_bytes(9), 'hex'::text)))::uuid; You can save yourself some CPU by skipping the extra cast, omitting the lpad, to_hex, and floor, and just grabbing the bytes from the bigint directly along with the random part since bytea can be concatenated. SELECT encode(int8send((EXTRACT(epoch FROM clock_timestamp()) * 1000000)::bigint) || gen_random_bytes(8), 'hex')::uuid ; Note that you refer to it as a ULID, but it is stored as a UUID. Hopefully nothing downstream ever relies on UUID versioning/spec compliance. Now that I think of it, I could probably speed up my tagged interval UUID implementation using some of this at the expense of configurability.
On 2023-02-08 14:48:03 +0530, veem v wrote:
> So wanted to know from experts here, is there really exists any scenario in
> which UUID really cant be avoided?
Probably not. The question is usually not "is this possible" but "does
this meet the requirements at acceptable cost".
> Sequence Number = n*d+m+offset. Where n is the sequence order number, d is the
> dimensions of the multi-master replication, m ranges from 0 to n-1 is the
> number assigned to each node in the replication, and offset is the number to
> offset the sequence numbers.
Yes, you can do this. In fact, people (including me) have already done
this.
But it's relatively easy to mess this up:
Firstly, you have to make sure that d is larger than your number of
(active) replicas will ever be, but still small enough that you will
never overflow. Probably not a problem with 64 bit sequences (if you set
d to 1E6, you can still count to 9E12 on each node), but might be a
problem if you are for some reason limited to 32 bits.
Secondly (and IMHO more importantly) you have to make sure each node
gets its own unique offset. So this needs to be ensured during
deployment, but also during migrations, restores from backups and other
infrequent events.